Speaker Diarization: Current Limitations and New Directions

Knox, Mary Tai; EECS Department, University of California

PDF

Description

Speaker diarization is the problem of determining "who spoke when" in an audio recording when the number and identities of the speakers are unknown. Motivated by applications in automatic speech recognition and audio indexing, speaker diarization has been studied extensively over the past decade, and there are currently a wide variety of approaches – including both top-down and bottom-up unsupervised clustering methods. The contributions of this thesis are to provide a unified analysis of the current state-of-the-art, to understand where and why mistakes occur, and to identify directions for improvements. In the first part of the thesis, we analyze the behavior of six state-of-the-art diarization systems, all evaluated on the National Institute of Standards and Technology (NIST) Rich Transcription 2009 evaluation dataset. While performance is typically assessed in terms of a single number – the diarization error rate (DER) – we further characterize the errors based on speech segment durations and their proximity to speaker change points. It is shown that for all of the systems, performance degrades both as the segment duration decreases and as the proximity to the speaker change point increases. Although short segments are problematic, their overall impact on the DER is small since the majority of scored time occurs in segments greater than 2.5 seconds. By contrast, the amount of time near speaker change points is relatively high, and thus poor performance near these change points contributes significantly to the DER. For example, for the single distant microphone (SDM) and multiple distant microphone (MDM) conditions, over 33% and 40% of the errors occur within 0.5 seconds of a change point for all evaluated systems, respectively. In the next part of the thesis, we focus on the International Computer Science Institute (ICSI) speaker diarization system and explore the effects of various system modifications. This system contains many steps – including speech activity detection, initialization, speaker segmentation, and speaker clustering. Inspired by our previous analysis, we focus on modifications that improve performance near speaker change points. We first implement an alternative to the minimum duration constraint, which sets the shortest amount of speech time before a speaker change can occur. This modification results in a 12% relative improvement of the speaker error rate for the MDM condition, with the largest improvement occurring closest to the speaker change point, and a 3% relative improvement for the SDM condition. Next, we show how the difference between the largest and second largest log-likelihood scores provides valuable information for unsupervised clustering, namely it indicates which regions of the output are likely correct. Lastly, we explore the potential of applying speaker diarization methodologies to other applications. Specifically, we investigate the use of a diarization-based algorithm for the problem of duplication detection, where the goal is to determine whether a given query (e.g., a short audio clip) has been taken from a reference set (e.g., a large collection of copyrighted media). With minimal modifications of the ICSI diarization system, we are able to obtain moderate performance. However, our approach is not competitive with existing approaches designed specifically for the problem of duplication detection, and the extent to which diarization-based approaches are useful for this application remains an open question.