Text Analytics
In today’s world, the amount of information digitally available on web pages, social media, and digital documents, grows significantly every day. The analysis and usage of this information is a critical process in many application areas - business intelligence, medical decision making, customer care, knowledge management, and cybercrime prevention. However, the vast majority of this information is conveyed in the form of unstructured, written text, which cannot easily be analyzed automatically by a computer program.
The field of Text Analytics, a subfield of Natural Language Processing, aims to understand how humans use natural language to convey information and knowledge. The field develops techniques and models that enable computer programs to extract information and knowledge from unstructured text documents, in order to make these available in structured form for further processing by computerized applications. For example, the discovery of adverse drug reactions in patient forums can improve public health and patient safety in medication intake, and the automatic monitoring of news related to a company’s supplier network can improve supply chain risk management and enable faster decision making.
A major challenge in Text Analytics is that human language use is implicit — it omits information. Filling this information gap requires contextual inference, background- and commonsense knowledge, and reasoning over situational context. Language also evolves, i.e., it specializes and changes over time. Thus, language understanding also requires continuous and efficient adaptation to new languages and domains — and transfer to, and between, both. Current language understanding methods, however, focus on high resource languages and domains, use little to no context, and assume static data, task, and target distributions.
The DFKI Speech & Language Technology’s research aims to address these challenges. Our work in Text Analytics is centered around core research on domain adaptation, learning in low-resource settings, reasoning over larger contexts, continual learning, and multilingual models; in domains such as health/medicine, industry, and mobility. We strive for a deeper understanding of human language and thinking, with the goal of developing novel methods for processing and generating human language text, speech, and knowledge.
To this end, we combine deep linguistic analysis with state-of-the-art machine learning and neural approaches to NLP.
Other important aspects of our work are the creation of novel language resources for training and evaluating NLP models, the (linguistic) evaluation of NLP datasets and tasks, and the explainability of (neural) models. We are working on basic and applied research in areas covering, among others, Information Extraction & Knowledge Base Population, Sentiment Analysis, Text Classification, and Summarization. Many of our state-of-the-art research results are made freely available to the community on github. We also maintain a HuggingFace repository. More information on projects, code repositories, and datasets can be found here.
Selected Projects:
Code and Models