Hero image

Videx

Indexing videos for the „official bulletin“ of the Swiss Federal Parliament

The Swiss Federal Parliament records all of its meetings on video and publishes them on its website. The videos are accompanied by a transcript created by the specialists of the parliamentary services. Searching the website for a specific term reveals the correct recording and its corresponding transcript. Yet, there is no way to find the position the term is actually said. A first approach using pure voice recognition produced unsatisfying results.
The parliamentary services wanted to know therefore if the videos could be indexed based on the existing transcripts.

Situation

The Swiss Federal Parliament records all of its meetings on video and publishes them on its website. The videos are accompanied by a transcript created by the specialists of the parliamentary services. Searching the website for a specific term reveals the correct recording and its corresponding transcript. Yet, there is no way to find the position the term is actually said. A first approach using pure voice recognition produced unsatisfying results.1
The parliamentary services wanted to know therefore if the videos could be indexed based on the existing transcripts.

Forced alignment

Relating words of a transcript with the corresponding sounds in a recording is know as «forced alignment» in literature.2 Tools for forced alignment (Aligners) exist primarily for the purpose of linguistic research. A selection of the available tools3 has been tested to find out how well forced alignment works in general. Because all tools work out of the box for English, a 13-minute speech of John F. Kennedy and an exact transcript were used as a test corpus. The results indicate a very good alignment precision for this simple scenario. In order to know better, what precision would be required, a simple user test was conducted.

User test: Required precision

To understand the required precision of the alignment, we let 21 persons use a basic video-search interface4. The search results where shifted by a random delay of -9 to 2 seconds. The test showed that a delay of -3 to -4 seconds is acceptable in most cases. Half of this window is needed to avoid positive shifts. This leaves a window of about ±2 seconds.

Meeting videos corpus

Candidates for implementation were chosen based on licensing terms, supported languages and technological aspects. These candidates were tested with a larger corpus representing different attributes of the meeting videos. These attributes included language, similarity of transcript and audio as well as the sound quality of the video. All aligners placed over 90% of the words inside of the available threshold of ±2 seconds.

Dissimilarity of transcript and audio

The disssimilarity between the recording and the transcript was shown to be the biggest reason for shifts. When comparing a corpus of videos with >10% dissimilarity to the transcript with the rest, differences could be observed.
Aeneas is suffering most from the differences while MAUS manages to compensate them pretty well.

Conclusions

Of the three available solutions MAUS provides the highest precision. But it is rather slow and enforces the usage of a web service.

MAUS % of corpus Matches <2s
Highly similar 70–805 98.7%
Dissimilar 20–305 94.5%

While Aeneas performed worst in precision, its permissive AGPL license6 and high speed still make it a candidate for implementation.

MAUS % of corpus Matches <2s
 Highly similar 70–805 96.9%
Dissimilar 20–305 85.5%

  1. G. Szaszak, M. Cernak, P. N. Garner, P. Motlicek, A. Nanchen, and F. Tarsetti, “Automatic Speech Indexing System of Bilingual Video Parliament Interventions,” 2013. 

  2. N. Pörner, “Development of an automatic chunksegmentation tool for long transcribed speech
    recordings,” Jul. 2016. 

  3. “GitHub - pettarin/forced-alignment-tools: A collection of links and notes on forced alignment tools,” http://heig.ch/buyofu , Nov. 2017 

  4. Online example of the video search interface, http://heig.ch/dababe 

  5. Estimated by the staff of the Official bulletin” 

  6. “GNU Affero General Public License v3.0,” http://heig.ch/birona