Corpus 26 (26) (
2025)
Copy
BIBTEX
Abstract
This paper presents the results of a research work that aims to determine whether the upstream OCR correction can significantly improve the results of the Named Entity Recognition (NER) task. The experiments were applied to the ELTeC and Very Big Library (TGB) corpora. Our objective is to establish (i) a typology of OCR contaminations from the Kraken and Tesseract tools and (ii) a typology of errors in the automatic correction produced by the JamSpell tool. As part of our evaluation, we study the intersections between the NEs detected by the spaCy tool in the reference texts and the two OCR versions on the one hand, and on the other hand the cosine similarity used to measure the textual distance between these versions. Our results show that automatic correction introduces biases in the textual data and that filtering the output NEs seems to be a more promising approach. Finally, we find that incorrect NEs are overwhelmingly short and hapax entities in corpus, and that the entity length and document frequency are the discriminatory criteria to exclude candidate NEs with a precision of over 98%.