Videx | jonasoesch

The Swiss Federal Parliament records all of its meetings on video and publishes them on its website. The videos are accompanied by a transcript created by the specialists of the parliamentary services. Searching the website for a specific term reveals the correct recording and its corresponding transcript. Yet, there is no way to find the position the term is actually said. A first approach using pure voice recognition produced unsatisfying results. The parliamentary services wanted to know therefore if the videos could be indexed based on the existing transcripts.

Client

Swiss Federal Parliament

Period

October 2016 until February 2017

Product

Study about the feasability of an in-video search engine for parliamentary speeches

Contribution

Phonetics, speech recognition, prototyping, user reasearch

Situation

The Swiss Federal Parliament records all of its meetings on video and publishes them on its website. The videos are accompanied by a transcript created by the specialists of the parliamentary services. Searching the website for a specific term reveals the correct recording and its corresponding transcript. Yet, there is no way to find the position the term is actually said. A first approach using pure voice recognition produced unsatisfying results.^[1]
The parliamentary services wanted to know therefore if the videos could be indexed based on the existing transcripts.

Forced alignment

Relating words of a transcript with the corresponding sounds in a recording is know as «forced alignment» in literature.^[2] Tools for forced alignment (Aligners) exist primarily for the purpose of linguistic research. A selection of the available tools^[3] has been tested to find out how well forced alignment works in general. Because all tools work out of the box for English, a 13-minute speech of John F. Kennedy and an exact transcript were used as a test corpus. The results indicate a very good alignment precision for this simple scenario. In order to know better, what precision would be required, a simple user test was conducted.

User test: Required precision

To understand the required precision of the alignment, we let 21 persons use a basic video-search interface^[4]. The search results where shifted by a random delay of -9 to 2 seconds. The test showed that a delay of -3 to -4 seconds is acceptable in most cases. Half of this window is needed to avoid positive shifts. This leaves a window of about ±2 seconds.

Meeting videos corpus

Candidates for implementation were chosen based on licensing terms, supported languages and technological aspects. These candidates were tested with a larger corpus representing different attributes of the meeting videos. These attributes included language, similarity of transcript and audio as well as the sound quality of the video. All aligners placed over 90% of the words inside of the available threshold of ±2 seconds.

Dissimilarity of transcript and audio

The disssimilarity between the recording and the transcript was shown to be the biggest reason for shifts. When comparing a corpus of videos with >10% dissimilarity to the transcript with the rest, differences could be observed.
Aeneas is suffering most from the differences while MAUS manages to compensate them pretty well.

Conclusions

Of the three available solutions MAUS provides the highest precision. But it is rather slow and enforces the usage of a web service.

MAUS	% of corpus	Matches <2s
Highly similar	70–80^[5]	98.7%
Dissimilar	20–30^[5:1]	94.5%

While Aeneas performed worst in precision, its permissive AGPL license^[6] and high speed still make it a candidate for implementation.

MAUS	% of corpus	Matches <2s
Highly similar	70–80^[5:2]	96.9%
Dissimilar	20–30^[5:3]	85.5%

G. Szaszak, M. Cernak, P. N. Garner, P. Motlicek, A. Nanchen, and F. Tarsetti, “Automatic Speech Indexing System of Bilingual Video Parliament Interventions,” 2013. ↩︎
N. Pörner, “Development of an automatic chunksegmentation tool for long transcribed speech
recordings,” Jul. 2016. ↩︎
“GitHub - pettarin/forced-alignment-tools: A collection of links and notes on forced alignment tools,” http://heig.ch/buyofu , Nov. 2017 ↩︎
Online example of the video search interface, http://heig.ch/dababe ↩︎
Estimated by the staff of the Official bulletin” ↩︎ ↩︎ ↩︎ ↩︎
“GNU Affero General Public License v3.0,” http://heig.ch/birona ↩︎