Publikation

Dialectal Filtering: Synthesizing Kurdish Corpora for Low-Resource Varieties by Utilizing “Noise” in Large Textual Data

Christian Schuler; Raman Ahmad; Ānrán Wáng; Daniil Gurgurov; Timo Baumann; Simon Ostermann; Josef van Genabith

In: TBA. International Conference on Language Resources and Evaluation (LREC-2026), May 11-16, Palma de Mallorca, Spain, TBA, 2026.

Zusammenfassung

This work introduces a dialect-aware text filtering framework to pre-process, clean, and enhance large text corpora, creating variety-specific sub-corpora for neglected language varieties. We apply our framework to Kurdish, a language with rich dialectal diversity, which presents significant challenges for Natural Language Processing (NLP) due to its low-resource status and the noisy nature of available text corpora. Leveraging lexicographic features, we assign multi-language-labels to text instances and synthesize over 130 new sub-corpora from pre-existing but noisy datasets. By “noisy” we mean data sets that contain a mix of Kurdish varieties where none of the varieties is flagged or labeled as such in the data. While the approach demonstrates promise, challenges such as feature overlap and context ambiguity persist, with expansion beyond lexicographic features left for future work. This work contributes to the creation of low-resource language technology foundations, especially dialect-specific NLP applications. Specifically, we advance research on Kurdish languages by providing insights into the linguistic relationships among Kurdish varieties.