Corpus - PhilPapers

La question de la normalisation des écrits scolaires pour leur traitement automatique. Le cas de l’omission de mots.Martina Ponton Barletta - 2025 - Corpus 26 (26).details
This paper addresses the treatment of noise caused by word omissions in a corpus of school writings, in order to facilitate their subsequent automatic processing. While a normalization step may facilitate the processing of these texts, certain linguistic expressions remain challenging to comprehend, particularly in instances where the writer omits words from the text. The present contribution proposes three automatic and semi-automatic potential solutions to this problem. The first method employs a "mask" token in the form of xxx. The second (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Managing Noise in Part-of-Speech Tagging for Extremely Low-Resource Languages: Comparing Strategies for Corpus Collection and Annotation in Dagur and Alsatian.Delphine Dolińska Bernhard - 2025 - Corpus 26 (26).details
Although Dagur and Alsatian represent two typologically distant language families, they share several similarities: both languages are endangered, do not have a unified spelling system, and have few available digital corpora. Given these challenges, the main aim of this article is to compare the noise in corpora for these languages and its impact on part-of-speech (POS) annotation and tagging. We first discuss what strategies can be used to reduce the noise due to spelling inconsistencies observed during corpus collection, using Dagur (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Transcription automatique des interactions verbales. Limites observées et perspectives envisagées à partir d’un corpus de consultations médicales.Thomas Quellec Bertin - 2025 - Corpus 26 (26).details
Speech-to-Text applications have made dazzling progress in recent years (e.g. Whisper). However, since they are usually intended to generate texts conform to written standards, they tend to blur marks of an oral nature (e.g. repetitions, pauses in the stream of words, phatics like er…). Thus, even if such applications suggest huge benefits in terms of working time as well as transcription accuracy, they remain inadequate for verbal exchanges analysis. Relying on a sample of transcripts from medical consultations anticipating an awake (...)
No categories
Direct download (2 more)

Export citation

Bookmark
1
Extraire des données textuelles pour l’analyse du discours : le Détricoteur.Romuald Jordan Dalodiere - 2025 - Corpus 26 (26).details
There are a number of tools for extracting textual content on the Internet. Many of these tools were designed by researchers in the field of natural language processing, and are available as “packages”, which are sets of code files that developers have no trouble using but that may prove out of reach for laymen. Independent software solutions are few, and unlikely to satisfy the needs of researchers in the field of discourse analysis. In this article, we introduce a semi-automated, textual (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Élaboration de corpus pour l’enseignement de la grammaire : le corpus du jeu PP L’archer de Alloprof.Isabelle Leblanc Gauvin - 2025 - Corpus 26 (26).details
Corpora in the didactics of grammar are little studied, even though they appear to be central in supporting the teaching and, above all, the learning of grammatical notions. After clarifying their role in teaching and identifying the characteristics of a well-constructed corpus, we analyze the corpus of the popular digital game PP L’archer from the Alloprof platform. We find that it contains a limited number of sentences and that they lack diversity in their construction. We discuss the relevance of our (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Introduction.Elisa Pallanti Gugliotta - 2025 - Corpus 26 (26).details
Contexte Ce volume naît de la richesse des discussions et des échanges qui ont eu lieu lors de la journée d’étude intitulée « Bruit de fond ou valeur ajoutée? Gérer le bruit lors des traitements informatiques des corpus linguistiques », qui s’est tenue à Grenoble le 28 avril 2023. Cet événement, organisé conjointement par l’Université Grenoble Alpes et l’Université Sapienza de Rome, a vu la collaboration des laboratoires LIDILEM (Univ. Grenoble Alpes) et ECP (Univ. Lumière Lyon 2) ainsi que...
No categories
Direct download (2 more)

Export citation

Bookmark 1 citation
1
Terminologie de la couleur bleue en diachronie longue selon Google Livres Ngram Viewer.Agnieszka K. Kaliska - 2025 - Corpus 26 (26).details
The digitization of ancient texts and tools for processing written corpora now make it possible to refine linguistic research in long diachrony. The aim of this analysis is to use the Ngram Viewer to study changes in the frequency of blue and its shades (studied through various terms, names of colors, pigments and dyes, e.g. bleu céleste, bleu roi, bleu de Paris, safre) in the French sub-collection of the Google Books corpus, which includes works published between 1500 and 2022. As (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Navigating Noise: A Stratified Model for Scholarly Digital Editions of Arabic Manuscripts in Hebrew Script.Valentina B. Lanza - 2025 - Corpus 26 (26).details
This article investigates the role of Scholarly Digital Editions in advancing textual scholarship, with a focus on Judeo-Arabic, a unique linguistic variety using Hebrew script to represent Arabic. It proposes a stratified model for Scholarly Digital Editions to handle Judeo-Arabic’s orthographic complexities, including transcription, script conversion, normalization, and metadata integration. The study highlights the critical role of managing noise in textual editing, particularly during data acquisition, where any distortion or omission of orthographic elements can compromise scholarly integrity. A case study (...)
No categories
Direct download (2 more)

Export citation

Bookmark
1
Des bruits dans mon corpus : des données à réduire au silence, à atténuer ou à écouter attentivement?Loïc Liégeois - 2025 - Corpus 26 (26).details
In the field of NLP, “noise” is a notion with many different meanings. This is also true in fields of linguistics in which the analysis of ecological data is central.In studies involving corpus linguistic methods, noise management is an essential process for data collection, data structuration and data analysis. Paradoxically, this step is almost never developed, or completely ignored.In this paper, we propose to focus on the management of noise during the various stages classically defined in the processing of an (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Le bruit dans la mesure de la composante cognitive de l’émotion pour l’évaluation de l’acceptabilité des innovations.Jonas Noblet - 2025 - Corpus 26 (26).details
This article explores the notion of noise in emotion annotation, within the context of measuring the acceptability of innovations. For the task under study, using annotator disagreement as a measure of noise tends to overestimate its presence in the data. An alternative approach is proposed, which involves modeling the annotation process through a probabilistic model. This method accounts for the variability in annotations, though it does limit the ability to accurately assess the noise component.
No categories
Direct download (2 more)

Export citation

Bookmark
Decoding Dominant Narratives on Montenegro in the German Press - A Corpus-Driven Analysis (2016-2023).Sabina Vanni Osmanovic - 2025 - Corpus 26 (26).details
Dans le contexte d’une révision de l’élargissement de l’UE aux Balkans, la couverture médiatique du Monténégro, l’un des principaux acteurs de ce processus, dans la presse allemande est devenue plus présente ces dernières années. Cette recherche a pour but d’explorer les représentations du Monténégro dans quatre journaux allemands de premier plan. La contribution propose une approche d’analyse de données textuelles appliquée à un corpus composé d’articles consacrés au Monténégro entre 2016 et 2023. Grâce à des méthodes statistiques de texte telles (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Quelle solution pour améliorer les performances de la reconnaissance d’entités nommées sur des données bruitées, corriger l’entrée ou filtrer la sortie?Ljudmila Koudoro-Parfait Petkovic - 2025 - Corpus 26 (26).details
This paper presents the results of a research work that aims to determine whether the upstream OCR correction can significantly improve the results of the Named Entity Recognition (NER) task. The experiments were applied to the ELTeC and Very Big Library (TGB) corpora. Our objective is to establish (i) a typology of OCR contaminations from the Kraken and Tesseract tools and (ii) a typology of errors in the automatic correction produced by the JamSpell tool. As part of our evaluation, we (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Apprivoiser le « bruit » en linguistique de corpus : expérience d’une analyse factorielle et propositions.Bénédicte Pincemin - 2025 - Corpus 26 (26).details
As a concrete basis for thought, we first share the experience of a correspondence analysis applied to a morphosyntactically tagged corpus. The automatic tagging has been partially checked and shows errors. Knowing how factorial analysis works, and having identified the noisiest tags, we adjust the data to be submitted to the analysis so that it is both reliable and clear, making sense for the interpretation of the results to be drawn.This experiment calls for a three-fold discussion. Firstly, about the evaluation (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Numériser le patrimoine linguistique québécois : l’exemple des fiches dialectologiques de Gaston Dulong.Wim Remysen - 2025 - Corpus 26 (26).details
In this article, we describe the various steps that led to the digitization of the dialectological records compiled by Gaston Dulong based on surveys conducted throughout Québec since the late 1940s. Due to their heritage value and scientific interest, these records are currently added to the Fonds de données linguistiques du Québec (FDLQ). However, preparing this corpus poses significant challenges due to the large number of documents to be processed, their physical condition as well as the nature of the data (...)
No categories
Direct download (2 more)

Export citation

Bookmark
1
À pas de loup dans la bergerie… La problématique du silence et du bruit dans l’étiquetage automatique du Subjonctif Présent en français parlé.Christian Surcouf - 2025 - Corpus 26 (26).details
Corpus linguistics has benefited greatly from advances in data processing. However, for the ordinary linguist, the use of morphosyntactic taggers still carries a certain element of mystery and magic. Equipped with this tool, the linguist can work on massive data, but is mostly unaware of how the labels were assigned, and more fundamentally, what errors may have arisen during labeling. And yet, the quality of the labeling process will determine the quality of the results used for analysis. While noise is (...)
No categories
Direct download (2 more)

Export citation

Bookmark
Quand le verre s’efface derrière le vin : une analyse discursive des représentations du verre à vin.Albin Marchal Wagener - 2025 - Corpus 26 (26).details
While studies on wine tasting are abundant, particularly in sensory marketing, few analyses focus on the wine glass as a specific object. This object has been examined from various perspectives, including alcohol consumption (Banwell 1999), social practices (Lo Monaco et al. 2009), and even more intriguingly, the influence of the glass on the perception of wine aroma (Cliff 2001, Delwiche & Pelchat 2002). The purpose of this article, however, is to to analyze the linguistic and communicational object of the "wine (...)
No categories
Direct download (2 more)

Export citation

Bookmark

Previous issues

Next issues

Off-campus access

Using PhilPapers from home?

Create an account to enable off-campus access through your institution's proxy server or OpenAthens.