Quelle solution pour améliorer les performances de la reconnaissance d’entités nommées sur des données bruitées, corriger l’entrée ou filtrer la sortie?

Corpus 26 (26) (2025)
  Copy   BIBTEX

Abstract

This paper presents the results of a research work that aims to determine whether the upstream OCR correction can significantly improve the results of the Named Entity Recognition (NER) task. The experiments were applied to the ELTeC and Very Big Library (TGB) corpora. Our objective is to establish (i) a typology of OCR contaminations from the Kraken and Tesseract tools and (ii) a typology of errors in the automatic correction produced by the JamSpell tool. As part of our evaluation, we study the intersections between the NEs detected by the spaCy tool in the reference texts and the two OCR versions on the one hand, and on the other hand the cosine similarity used to measure the textual distance between these versions. Our results show that automatic correction introduces biases in the textual data and that filtering the output NEs seems to be a more promising approach. Finally, we find that incorrect NEs are overwhelmingly short and hapax entities in corpus, and that the entity length and document frequency are the discriminatory criteria to exclude candidate NEs with a precision of over 98%.

Other Versions

No versions found

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 101,505

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

Similar books and articles

Analytics

Added to PP
2025-01-28

Downloads
0

6 months
0

Historical graph of downloads

Sorry, there are not enough data points to plot this chart.
How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

No references found.

Add more references