Rev AI at InterSpeech 2019

INTERSPEECH is always a fascinating conference, with really smart folks, from all over the world, working on a very broad range of applications and

Written by:

Miguel Jette

September 23, 2019

Table of contents

ASR Systems

Speaker Diarization

Rich Transcription

INTERSPEECH is always a fascinating conference, with really smart folks, from all over the world, working on a very broad range of applications and research topics. In our daily work, we are typically very focused on the topics that are relevant to us as a company, and so it’s nice to take time to learn about new applications and new research areas. For example, I was amazed to learn about the new concerns over TTS fraud and the relatively new (to me at least) field dedicated to detecting spoofing (e.g. ASVspoof Challenge).

This year, the Rev AI Speech R&D team attended the conference in Graz, Austria. The team and I attended oral and poster sessions, looking for new trends in speech research, with a focus on general ASR systems (both hybrid and E2E), Speaker Diarization, and Rich Transcription. Of course, we also took the opportunity to reconnect with old colleagues, enjoy the sights in Graz, and meet new people from the industry and academia.

With this in mind, I wanted to share some of our favorite papers and posters.

ASR Systems

The first talk of the conference, given by Ralf Schluter, presented a very good overview of modeling in ASR over the past few decades. From this presentation and many others during the week, it is clear that there has been tremendous progress in the end-to-end systems, but that those systems have not yet caught up with the traditional hybrid systems. It still seems like we are 3-5 years away from state-of-the-art e2e systems being used in production, and, by that time, we hopefully won’t be calling them end-to-end anymore.

What we found most interesting (and relevant to us) were the more traditional ASR systems presented, and some of the hidden details presented in the different posters and oral presentations.

Here is our selection of favorite papers/posters:

Paper / Poster	Part of the Abstract
LF-MMI Training of Bayesian and Gaussian Process Time Delay Neural Networks for Speech Recognition (PDF)	This paper investigates the use of Bayesian learning and Gaussian Process (GP) based hidden activations to replace the deterministic parameter estimates of standard lattice-free maximum mutual information (LF-MMI) criterion trained time-delay neural network (TDNN) acoustic models.
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention (PDF)	We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task.
Language Modeling with Deep Transformers (PDF)	We explore deep autoregressive Transformer models in language modeling for speech recognition.
Real-time One-pass Decoder for Speech Recognition Using LSTM Language Models (PDF)	This paper proposes a new architecture to perform real-time one-pass decoding using LSTM language models. To make decoding efficient, the estimation of look-ahead scores was accelerated by precomputing static look-ahead tables.
Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale (PDF)	In this paper, we present a novel Neural Architecture Search (NAS) framework to improve keyword spotting and spoken language identification models.

Speaker Diarization

This year’s conference was also promising for advances in Speaker Diarization, with the second edition of the DIHARD challenge. The team came back with a few new ideas to test out from the different poster sessions on diarization.

Paper / Poster	Part of the Abstract
Speaker Diarization with Lexical Information	This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
The 2019 Inaugural Fearless Steps Challenge: A Giant Leap for Naturalistic Audio	The 2019 FEARLESS STEPS (FS-1) Challenge is an initial step to motivate a streamlined and collaborative effort from the speech and language community towards addressing massive naturalistic audio, the first of its kind.
End-to-End Neural Speaker Diarization with Permutation-Free Objectives	In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Our model has a single neural network that directly outputs speaker diarization results.
Joint Speech Recognition and Speaker Diarization via Sequence Transduction	Our approach utilizes both linguistic and acoustic cues to infer speaker roles, as opposed to typical SD systems, which only use acoustic cues. […] Compared to a competitive conventional baseline, our approach improves word-level diarization error rate from 15.8% to 2.2%.
ViVoLAB Speaker Diarization System for the DIHARD 2019 Challenge	Winner of the second challenge. This paper presents the latest improvements in Speaker Diarization obtained by ViVoLAB research group for the 2019 DIHARD Diarization Challenge.

Rich Transcription

Lastly, the topic of creating full rich transcriptions from any given audio is obviously very interesting for us. Again, the team came back with a lot of ideas to test out, and confirmation that we are indeed building state-of-the-art systems here for Rev AI.

Paper / Poster	Part of the Abstract
Exploring Methods for the Automatic Detection of Errors in Manual Transcription (PDF)	In this work, we propose a novel acoustic model-based approach, focusing on the phonetic sequence of speech. Both methods have been evaluated on a completely real dataset, which was originally transcribed with errors and strictly corrected manually afterwards.
Leveraging a character, word and prosody triplet for an ASR error robust and agglutination friendly punctuation approach (PDF)	In this work, we propose to consider character, word, and prosody-based features all at once to provide a robust and highly language-independent platform for punctuation recovery, which can also deal well with highly agglutinating languages with less constrained word order.
Automatic Compression of Subtitles with Neural Networks and its Effect on User Experience (PDF)	We propose a neural network model based on an encoder-decoder approach with the possibility of integrating the desired compression ratio.
The Althingi ASR System (PDF)	All performed speeches in the Icelandic parliament, Althingi, are transcribed and published. An automatic speech recognition system (ASR) has been developed to reduce the manual work involved.