Automatic-Subtitling, Comparison on the Performance of Forced Alignment and Automatic Speech Recognition

Mino Lee Sasse; Stefan Schaffer; Aaron Ruß

In: 32. Konferenz Elektronische Sprachsignalverarbeitung. Elektronische Sprachsignalverarbeitung (ESSV-2021), March 3-5, Berlin/Virtual, Germany, TUDpress, Dresden, 2021.


This work is focusing on the automatic generation of subtitles using different tools that can be categorized as Forced Aligners (FAs) or Automatic Speech Recognizers (ASRs). A comparison of the performance of FA and ASR for the task of generating same-language subtitles was conducted. The prime motivation was a previous task, which was the extraction of sentence-utterances in different audio files using word-timestamps. Three different tools were used for this work: aeneas [1] which is an FA, Cerence [2], which is an ASR and Sonix [3], which is also an ASR. We conducted a technical evaluation and a subjective evaluation based on a case study. In this study people were presented with different stimuli, each stimulus using generated subtitles based on the time-information given by the different tools mentioned above. The resulting data of a case study confirmed a rise in performance of Cerence compared to aeneas.


Weitere Links

19_sasse.pdf (pdf, 374 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence