Multilinguality and Language Technology

E&E Group: Efficient and explainable NLP models

Modern NLP models and LLMs have specific flaws, despite being highly performant: First, they are black boxes: Parameters of proprietary models are not accessible at all; and even non-proprietary models are largely opaque in the sense that it is unclear where exactly specific knowledge is encoded in potentially billions of parameters. Second, there is a tendency to always increase the size of LLMs and training data to improve performance, which is especially problematic for domains or languages with fewer resources.

The E&E group of DFKI’s Research Department Multilinguality and Language Technology works on transparent and efficient NLP models. Our objective is to make the parameters and behaviour of LLMs more explainable and understandable to both end users and researchers. We try to improve LLMs with regard to data consumption, e.g. for domains or languages where data is scarce, by using structured data, new learning techniques, or other modalities; and in terms of model size, e.g. for settings where powerful hardware is not available.

We are involved in Twinning projects, where we provide knowledge transfer both on research topics and project management to newly established research institutions across Europe. We are involved in European procurement projects focusing on language resources, such as the European Language Resource Coordination and the Language Data Space.

Some current projects:

soofi - Sovereign Open Source Foundation Models

We are developing a larger AI language model that will be made available to the economy and society as open source. Based on a large language model (LLM), a so-called reasoning model will also be created using special procedures to increase the quality of the overall system and optimise resource consumption. In addition, initial use cases are to be implemented using AI agent technologies.

lorAI - Low Resource Artificial Intelligence

The main objective of the lorAI project is to upgrade the Kempelen Institute of Intelligent Technologies (KInIT) to a leading R&I institution in low resource artificial intelligence (LRAI) in Slovakia and Europe.

Fair Forward

Consulting services to Gesellschaft für Internationale Zusammenarbeit (GIZ) on technical aspects of AI in international cooperation including natural language processing (NLP), training data and data access for FAIR Forward – Artificial Intelligence for All. GIZ Project No. 19.2010.7-003.00

PERKS - Eliciting and Exploiting Procedural Knowledge in Industry 5.0

The PERKS project supports the holistic governance of industrial PK in its entire life cycle, from elicitation to management and from access to exploitation.

TRAILS - Trustworthy and Inclusive Machines

Duration: 08/01/2024 - 07/31/2027
In TRAILS we focus on three main research directions: (i) inclusion of underrepresented languages and cultures through multilingual and culturally sensitive NLP, (ii) robustness and fairness with respect to long-tail phenomena and classes and "trustworthy content", and (iii) robust and efficient NLP models that enable training and deployment of models for (i) and (ii). We also partially address economic inequality by aiming for more efficient models (objective (iii)), which directly translates into a lower resource/cost footprint.

Selected recent publications

Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann
Accepted at the International Joint Conference on Natural Language Processing \& Asia-Pacific Chapter of the Association for Computational Linguistics, 2025 (Main)
Modular Arithmetic: Language Models Solve Math Digit by Digit
Tanja Baeumel, Daniil Gurgurov, Yusser al Ghussin, Josef van Genabith, Simon Ostermann
Accepted at the International Joint Conference on Natural Language Processing \& Asia-Pacific Chapter of the Association for Computational Linguistics, 2025 (Findings)
The Lookahead Limitation: Why Multi-Operand Addition is Hard for LLMs
Tanja Baeumel, Josef van Genabith, Simon Ostermann
Accepted at the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages
Tatiana Anikina, Jan Cegin, Jakub Simko, Simon Ostermann
Accepted for EMNLP 2025 (Main Conference)
Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
Robert Belanec, Simon Ostermann, Ivan Srba, Maria Bielikova
Accepted at European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD).
Soft Language Prompts for Language Transfer
Ivan Vykopal, Simon Ostermann, Marián Šimko
In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. (Volume 1: Long Papers), pages 10294–10313. 2025.
GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge
Daniil Gurgurov, Rishu Kumar, Simon Ostermann
In: Findings of the Association for Computational Linguistics: NAACL 2025, pages 1204–1221, Albuquerque, New Mexico. Association for Computational Linguistics. 2025.
HybridBERT - Making BERT Pretraining More Efficient Through Hybrid Mixture of Attention Mechanisms
Gokul Srinivasagan and Simon Ostermann
In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 285–291, Mexico City, Mexico. Association for Computational Linguistics. Runner-Up Best Paper Award.

Links

E&E Members

Team Lead:
Dr. Simon Ostermann
simon.ostermann@dfki.de

Team Members:
Yusser al Ghussin
Tatiana Anikina
Tanja Bäumel
Daniil Gurgurov
Cennet Oguz
Stefania Racioppa

MSc Students and Research Assistants:
Katja Konermann
Lena Sophie Oberkircher
Amelie Seyfried
Gregory Charles Shook
Arushi Singhal
Mikhail Sonkin