The Swiss Federal Parliament records all of its meetings on video and publishes them on its website. The videos are accompanied by a transcript created by the specialists of the parliamentary services. Searching the website for a specific term reveals the correct recording and its corresponding transcript. Yet, there is no way to find the position the term is actually said. A first approach using pure voice recognition produced unsatisfying results. The parliamentary services wanted to know therefore if the videos could be indexed based on the existing transcripts.
Situation
The Swiss Federal Parliament records all of its meetings on video and publishes them on its website. The videos are accompanied by a transcript created by the specialists of the parliamentary services. Searching the website for a specific term reveals the correct recording and its corresponding transcript. Yet, there is no way to find the position the term is actually said. A first approach using pure voice recognition produced unsatisfying results.[1]
The parliamentary services wanted to know therefore if the videos could be indexed based on the existing transcripts.
Forced alignment
Relating words of a transcript with the corresponding sounds in a recording is know as «forced alignment» in literature.[2] Tools for forced alignment (Aligners) exist primarily for the purpose of linguistic research. A selection of the available tools[3] has been tested to find out how well forced alignment works in general. Because all tools work out of the box for English, a 13-minute speech of John F. Kennedy and an exact transcript were used as a test corpus. The results indicate a very good alignment precision for this simple scenario. In order to know better, what precision would be required, a simple user test was conducted.
User test: Required precision
To understand the required precision of the alignment, we let 21 persons use a basic video-search interface[4]. The search results where shifted by a random delay of -9 to 2 seconds. The test showed that a delay of -3 to -4 seconds is acceptable in most cases. Half of this window is needed to avoid positive shifts. This leaves a window of about ±2 seconds.
Meeting videos corpus
Candidates for implementation were chosen based on licensing terms, supported languages and technological aspects. These candidates were tested with a larger corpus representing different attributes of the meeting videos. These attributes included language, similarity of transcript and audio as well as the sound quality of the video. All aligners placed over 90% of the words inside of the available threshold of ±2 seconds.
Dissimilarity of transcript and audio
The disssimilarity between the recording and the transcript was shown to be the biggest reason for shifts. When comparing a corpus of videos with >10% dissimilarity to the transcript with the rest, differences could be observed.
Aeneas is suffering most from the differences while MAUS manages to compensate them pretty well.
Conclusions
Of the three available solutions MAUS provides the highest precision. But it is rather slow and enforces the usage of a web service.
MAUS | % of corpus | Matches <2s |
---|---|---|
Highly similar | 70–80[5] | 98.7% |
Dissimilar | 20–30[5:1] | 94.5% |
While Aeneas performed worst in precision, its permissive AGPL license[6] and high speed still make it a candidate for implementation.
MAUS | % of corpus | Matches <2s |
---|---|---|
Highly similar | 70–80[5:2] | 96.9% |
Dissimilar | 20–30[5:3] | 85.5% |
G. Szaszak, M. Cernak, P. N. Garner, P. Motlicek, A. Nanchen, and F. Tarsetti, “Automatic Speech Indexing System of Bilingual Video Parliament Interventions,” 2013. ↩︎
N. Pörner, “Development of an automatic chunksegmentation tool for long transcribed speech
recordings,” Jul. 2016. ↩︎“GitHub - pettarin/forced-alignment-tools: A collection of links and notes on forced alignment tools,” http://heig.ch/buyofu , Nov. 2017 ↩︎
Online example of the video search interface, http://heig.ch/dababe ↩︎
Estimated by the staff of the Official bulletin” ↩︎ ↩︎ ↩︎ ↩︎
“GNU Affero General Public License v3.0,” http://heig.ch/birona ↩︎