Smart Data & Knowledge Services

Topic field: Multimedia Analysis and Data Mining (MADM)

In the topic area MADM, we are developing machine learning and data mining techniques to analyze and combine information from multi-modal inputs (e.g., combinations of image, audio, video, text, knowledge). The focus of our team can be structured into the overlapping areas of Deep Learning, Multi-modality, Efficiency and Explainability.

Deep Learning

The majority of our current research can be assigned to the Deep Learning field, in which we are mostly interested in the fusion of multiple modalities (further detailed below). Due to the many advances of Deep Learning being pioneered in the visual domain, our research often starts with scientific challenges and approaches of the Computer Vision domain (e.g., Image Classification, Captioning, Video Object Segmentation) and later focuses on the transfer of findings to and combination with other domains. Much of this research also includes the understanding and analysis of various common architectures and components (e.g., CNNs, RNNs), but also generative approaches, such as GANs and Variational Autoencoders (VAEs), and Reinforcement Learning (RL) approaches.


Even today the majority of models is still only applicable to one task (e.g., segmentation) of one modality (e.g., images). One of the overarching themes of our research can be described as the desire to transfer findings from one modality to another, and to combine information and signals from different modalities. Examples of the modalities worked on by our team are images, audio, video, text and knowledge. Along these lines, we have for example successfully transferred findings from the visual to the auditory domain (e.g., our recent ESResNet), applied visually motivated Deep Learning techniques to financial transaction data for outlier detection or combined satellite imagery with social media posts to analyze flooded areas. We are also focusing on tasks and challenges that can benefit from architectures using several modalities at the same time. One example for such multi-modality is the field of Visual Question Answering (VQA), in which systems need to extract relevant information from an image based on a natural language question. We are currently also investigating the opposite direction (Text-to-Image (T2I) based on GANs and VAEs), methods to include better NLP models (e.g., based on (hybrid) transformer architectures), but also how to include graph information for example from knowledge bases.


Combining modern Deep Learning approaches with multi-modal datasets leads to numerous challenges due to the structure and sheer sizes of the involved datasets or the training methodologies. This led us to our early and continued works on improving the efficiency of Deep Learning training processes, which has already resulted in the publication of several high performance open source libraries such as datadings, crumpets, simplejpeg and augpy. Based on these experiences, our team is also strongly involved in the DFKI’s activities to centralize Deep Learning oriented compute capabilities (GPU HPC). In these activities we are especially interested in developing and providing a flexible and easy to use cluster compute environment, which supports the large range of our research scenarios from single GPU to scalable multi-GPU and multi-Node trainings, while at the same time allowing efficient re-use and sharing of resources among the DFKI researchers.

Explainability (XAI)

Despite the continued success of DL methods in recent years, it is often challenging to explain why certain decisions have been made by such approaches. This hinders the adoption of such approaches in many areas. Hence, within our team we are also focusing on explaining and interpreting the models and bringing some light into the DL model black boxes. We are in this area especially also interested in robustness of approaches, protecting them against adversarial attacks, as well as accountability of decisions to certain aspects of the training datasets.

  • Multimedia Analysis
  • Image Analysis & Captioning
  • Video Object Segmentation
  • Audio Classification
  • Fusion Strategies
  • Remote Sensing / Satellite Imagery
  • Deep Learning models (in general)
  • Interpretability, Explainability, XAI, Robustness
  • Efficiency, GPU Computing, GPU HPC
  • Meta-Learning, Self-supervision and Unsupervised Learning
  • Anomaly detection
  • Embeddings

Members & Publications:

Former Members:

  • Adrian Ulges
  • Armin Stahl
  • Christian Schulze
  • Damian Borth
  • Joost van Beusekom
  • Markus Goldstein
  • Matthias Reif
  • Philipp Blandfort
  • Thomas Breuel




Dr. Jörn Hees
Phone: +49 631 20575 1180

Deutsches Forschungszentrum für
Künstliche Intelligenz GmbH (DFKI)
Smarte Daten & Wissensdienste
Trippstadter Str. 122
67663 Kaiserslautern

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz