Corpus 26 (26) (
2025)
Copy
BIBTEX
Abstract
As a concrete basis for thought, we first share the experience of a correspondence analysis applied to a morphosyntactically tagged corpus. The automatic tagging has been partially checked and shows errors. Knowing how factorial analysis works, and having identified the noisiest tags, we adjust the data to be submitted to the analysis so that it is both reliable and clear, making sense for the interpretation of the results to be drawn.This experiment calls for a three-fold discussion. Firstly, about the evaluation of such a process: indeed, by bypassing (rather than correcting) errors (noise), we shift the analysis framework, which no longer allows an evaluation based on the usual comparison of results and improvement measurement. Next, we justify how tweaking the data can still be a sound scientific approach, when the process is methodical and transparent. Then we consider whether corpus size matters, in relationship with the law of large numbers: statistics neutralize the noise of random fluctuations, but not that of repetitive bias.Thus, we argue for “taming” noise: that is, knowing your corpus and your tools, so as to be able to design quality analyses while managing the residual noise, and freeing yourself from a prior and probably illusory requirement for data perfection.