INTERSPEECH is always a fascinating conference, with really smart folks, from all over the world, working on a very broad range of applications and research topics. In our daily work, we are typically very focused on the topics that are relevant to us as a company, and so it’s nice to take time to learn about new applications and new research areas. For example, I was amazed to learn about the new concerns over TTS fraud and the relatively new (to me at least) field dedicated to detecting spoofing (e.g. ASVspoof Challenge).
This year, the Rev AI Speech R&D team attended the conference in Graz, Austria. The team and I attended oral and poster sessions, looking for new trends in speech research, with a focus on general ASR systems (both hybrid and E2E), Speaker Diarization, and Rich Transcription. Of course, we also took the opportunity to reconnect with old colleagues, enjoy the sights in Graz, and meet new people from the industry and academia.
With this in mind, I wanted to share some of our favorite papers and posters.
ASR Systems
The first talk of the conference, given by Ralf Schluter, presented a very good overview of modeling in ASR over the past few decades. From this presentation and many others during the week, it is clear that there has been tremendous progress in the end-to-end systems, but that those systems have not yet caught up with the traditional hybrid systems. It still seems like we are 3-5 years away from state-of-the-art e2e systems being used in production, and, by that time, we hopefully won’t be calling them end-to-end anymore.
What we found most interesting (and relevant to us) were the more traditional ASR systems presented, and some of the hidden details presented in the different posters and oral presentations.
Here is our selection of favorite papers/posters:
Paper / Poster | Part of the abstract |
The JHU ASR System for VOiCES from a Distance Challenge 2019 (PDF) | This paper describes the system developed by the JHU team for automatic speech recognition (ASR) of the VOiCES from a Distance Challenge 2019, focusing on single-channel distant/far-field audio under noisy conditions. |
LF-MMI Training of Bayesian and Gaussian Process Time Delay Neural Networks for Speech Recognition (PDF) | This paper investigates the use of Bayesian learning and Gaussian Process (GP) based hidden activations to replace the deterministic parameter estimates of standard lattice-free maximum mutual information (LF-MMI) criterion trained time-delay neural network (TDNN) acoustic models. |
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention (PDF) | We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task. |
Language Modeling with Deep Transformers (PDF) | We explore deep autoregressive Transformer models in language modeling for speech recognition. |
Real-time One-pass Decoder for Speech Recognition Using LSTM Language Models (PDF) | This paper proposes a new architecture to perform real-time one-pass decoding using LSTM language models. To make decoding efficient, the estimation of look-ahead scores was accelerated by precomputing static look-ahead tables |
Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale (PDF) | In this paper, we present a novel Neural Architecture Search (NAS) framework to improve keyword spotting and spoken language identification models. |
Speaker Diarization
This year’s conference was also promising for advances in Speaker Diarization, with the second edition of the DIHARD challenge. The team came back with a few new ideas to test out from the different poster sessions on diarization.
Paper / Poster | Part of the abstract |
Speaker Diarization with Lexical Information (PDF) | This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition. |
The 2019 Inaugural Fearless Steps Challenge: A Giant Leap for Naturalistic Audio (PDF) | The 2019 FEARLESS STEPS (FS-1) Challenge is an initial step to motivate a streamlined and collaborative effort from the speech and language community towards addressing massive naturalistic audio, the first of its kind. |
End-to-End Neural Speaker Diarization with Permutation-Free Objectives (PDF) | In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Our model has a single neural network that directly outputs speaker diarization results. |
Joint Speech Recognition and Speaker Diarization via Sequence Transduction (PDF) | Our approach utilizes both linguistic and acoustic cues to infer speaker roles, as opposed to typical SD systems, which only use acoustic cues. […] Compared to a competitive conventional baseline, our approach improves word-level diarization error rate from 15.8% to 2.2%. |
ViVoLAB Speaker Diarization System for the DIHARD 2019 Challenge (PDF) | Winner of the second challenge. This paper presents the latest improvements in Speaker Diarization obtained by ViVoLAB research group for the 2019 DIHARD Diarization Challenge. |
Rich Transcription
Lastly, the topic of creating full rich transcriptions from any given audio is obviously very interesting for us. Again, the team came back with a lot of ideas to test out, and confirmation that we are indeed building state-of-the-art systems here for Rev AI.
Paper / Poster | Part of the abstract |
Exploring Methods for the Automatic Detection of Errors in Manual Transcription (PDF) | In this work, we propose a novel acoustic model based approach, focusing on the phonetic sequence of speech. Both methods have been evaluated on a completely real dataset, which was originally transcribed with errors and strictly corrected manually afterwards. |
Leveraging a character, word and prosody triplet for an ASR error robust and agglutination friendly punctuation approach (PDF) | In this work we propose to consider character, word and prosody based features all at once to provide a robust and highly language independent platform for punctuation recovery, which can deal also well with highly agglutinating languages with less constrained word order. |
Automatic Compression of Subtitles with Neural Networks and its Effect on User Experience (PDF) | We propose a neural network model based on an encoder-decoder approach with the possibility of integrating the desired compression ratio. |
The Althingi ASR System (PDF) | All performed speeches in the Icelandic parliament, Althingi, are transcribed and published. An automatic speech recognition system (ASR) has been developed to reduce the manual work involved. |
If there were any interesting papers that we missed or if you want to chat about Interspeech, send us a note at speech@rev.com