Publication
Prosodic and other Long-Term Features for Speaker Diarization
G. Friedland; O. Vinyals; Y. Huang; Christian Müller
In: IEEE Transactions on Speech and Audio Processing, Vol. 17, No. 5, Pages 985-993, 2009.
Abstract
Speaker diarization is defined as the task of determining "who spoke when" given an audio track
and no other prior knowledge of any kind. The following article shows how a state-of-the-art speaker
diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic
and other long-term features. First, we present a framework to study the speaker discriminability of 70
different long-term features. Then, we show how the top-ranked long-term features can be combined
with short-term features to increase the accuracy of speaker diarization. The results were measured on
standardized datasets (NIST RT06 and RT07) and show a consistent improvement of about 30% relative
in diarization error rate compared to the best system presented at the NIST evaluation in 2007.