A challenge in multimedia content analysis is that of automatically understanding the content of audio-visual data, which would enable the development of novel and useful applications both for the end-user and the content provider. In this project we focus on the analysis of audio data, and in particular of the people whose voice is being recorded. Our aim is to create robust methods to identify who is speaking (referred to as speaker recognition), and when they are speaking (referred to as speaker diarization or tracking).
Note that current speech processing algorithms typically perform significantly better if they are queried with speech from a single speaker (speaker verification and identification) or with the different speakers already separated in order to perform speaker adaptation (speech recognition).
Direct applications for speaker recognition technology include biometrics, i.e. the verification of the person’s identity by comparing his/her voice with the model from the person they claim to be; and the identification of a speaker’s identity among a set of possible known speakers.
Speaker diarization algorithms focus on automatically labeling the different people that participate in a recorded conversation, indicating when each person has spoken. Such a task is usually performed without prior knowledge of the identity or the number of speakers. Current state-of-the-art systems are able to obtain good diarization error rates, but most of them are rather slow, which is an important limitation in applications that need faster than real-time processing.
Finally, speaker tracking finds when a known speaker has spoken within a long recording.
Through the collaboration of the research teams in Telefónica R&D with the University of Avignon (France) we have developed and submitted for patenting an algorithm to robustly model the speaker’s unique audio characteristics and very efficiently (up to 10 times faster than state-of-the-art systems) compare them with other speaker models.