Highlights of INTERSPEECH 2020
First off, the organisers of INTERSPEECH 2020 have really outdone themselves with the first ever, fully remote INTERSPEECH this year. Of course, we all would have loved a trip to Shanghai! But the format of this year’s presentations was fantastic. Each paper was accompanied by a 1.5 minute overview talk, and a 15-minute in-depth presentation of the research. For some of us audiovisual learners, this has been extremely helpful!
Let’s dive into some of our highlights this year. Our focus is on speech synthesis, as our main area of research; and because we work with video translation, we’re particularly interested in anything multilingual, multimodal, translation- or emotion- and prosody-related, bringing us closer to our goal to mimic the full range of human expression within our systems.
Finally, if you have missed our own contributions (Mohan et al. 2020, Staib et al. 2020), we have previously shared blog posts, describing our work on incremental TTS and zero-shot TTS.
It is probably also worth highlighting a few interesting presentation styles that we think could only happen in speech. While recording presentations with Zoom was recommended, it seems that people with access to a soundproof recording booth did not forgo the opportunity to use it. We also saw a few presentations which added their own subtitles, which we found really considerate. Finally, we’re pretty sure that at least one presenter was bold enough to have their presentation read out by a TTS system.
While code-switching and multilingual TTS was still a very novel area of research in 2019, a lively community seems to have evolved around code-switched TTS in 2020 (e.g., Zhao et al., Liu and Mak, Fu et al.). We welcome this uptake, given it is something we have been researching since our inception. We were especially pleased to see the work by Federico et al. on automatic dubbing, continuing in line with earlier work presented at INTERSPEECH 2019 (Öktem et al.). Finally, proof points on (monolingual) prosody transfer, as presented by Karlapati et al., open up new avenues towards expressivity, intonation, emotion and style. We are very excited to see this field grow, and hope for continuations of work such as Tyagi at al., demonstrating how we can use information from real speech to create more varied, natural performances of a sentence.
Many of the other things we saw are continuations of previous years’ trends, for instance in the domain of neural vocoders and few-shot, multi-speaker TTS. It seems that in these areas, all “first time ever”-s, the “hero experiments” (borrowing the terminology of John R. Treichler at ICASSP 2019) have been done, and we are in the phase of toying and tinkering with different combinations of tried and tested ideas.
Last year, we left INTERSPEECH with a sense that neural vocoders will soon be “solved” once and for all, and the speech community will move on from them. The reason for this was that there were a plethora of ideas and research groups working towards the same goal: A fast, efficient, universal neural vocoder that works well on generated text-to-speech (TTS) features. This year, the hamster wheel of neural vocoders continues: The goal is still to be faster, more compute-efficient, more natural, more robust, more universal or more adaptable to a new speaker with little additional data. Many papers were exploiting and combining the existing ideas from last year: Models improving on top of the GAN idea (such as VocGAN by Yang et al.), LPC-feature based models (e.g., Kanagawa and Ijima), source-filter models (with Wang and Yamagishi continuing at the forefront), and flow-based models (e.g., WG-WaveNet by Hsu and Lee), along with other tricks such as weight pruning and knowledge distillation (e.g., Tian et al.). Most of these papers target speed or efficiency, while maintaining the same level of quality that we get with slow, autoregressive neural vocoders. Another important goal is “universality”, or generality. To this end, Paul et al. have trained a speaker-conditional WaveRNN, which gets a speaker embedding learned from Speaker Verification. This improved the quality of vocoded samples for both seen and unseen speakers.
An aspect of neural vocoding that we think deserves a little more attention in the future is training on generated versus ground truth features: While most production environments require a neural vocoder to be trained for a specific TTS system (using teacher-forced outputs), being able to train on ground truth features is highly desirable, as it allows for untranscribed data to be used, and needs less adjustments over time. Wu et al. took a stab in this direction by developing a CycleGAN based post-filtering approach to create neural vocoder training data and enhance inference features.
As the number of applications for speech technology continues to grow, voice personalization has become increasingly desirable. Traditionally, data-driven voice conversion and adaptation techniques have grappled with the amount of target speaker data required, the need for parallel training corpora, and the best approaches for capturing speaker identity information. To that end, this year’s Interspeech features a host of new approaches that tackle those very challenges.
Polyak et al.’s unsupervised waveform-to-waveform approach eliminates the need for parallel training data entirely, instead using a pre-trained ASR as an encoder to extract speech features that are then passed, together with a target speaker embedding, through a WaveNet decoder to directly generate speech in the target speaker’s voice. The speaker invariant-nature of the ASR features allows the same audio sample to be used as both input and output during training. An approach that allows for training in both parallel and nonparallel settings is proposed by Ishihara and Saito: They use a dual encoder setup to extract content and speaker features from source and reference respectively. We like that they’ve addressed the limitations of using fixed-length embeddings to represent speaker information, especially in pursuit of their goal of one-shot conversion — voice conversion given just a single sample from a target speaker unseen during training. An additional attention mechanism helps the model synthesize the encoder features into content-dependent speaker information to inform the generation process. While not addressing speaker conversion explicitly, Choi et al. approach voice personalization as a few-shot TTS problem — synthesizing speech in the voice of a speaker unseen in training. Similarly motivated to go beyond the fixed-length speaker embedding, their Attentron TTS model augments Tacotron using both coarse and fine-grained speech encoders to extract a high-level speech embedding and variable-length style embeddings respectively from one or more reference audios.
As voice conversion and cloning techniques continue to move forward alongside other areas of speech, we’re excited to see how new trends in multi-speaker TTS modeling, like the use of Deep Gaussian Processes (DGPs) in Mitsui et al., might be applied in these same one- or few-shot fashions!
Naturalness has long been the metric that speech synthesis systems have sought to maximise. With the advent of TTS models like Tacotron 2 (amongst others), we have seen systems inching ever closer towards the score obtained by human utterances. In parallel, however, there has been a push to go beyond ‘natural’ sounding speech to improving expressivity. The nuance is subtle, but extremely pertinent, not least for an application such as ours. For example, a voice may sound extremely human-like, but if it is monotonous and single-paced, it makes watching a 10 minute video tedious work indeed! Here goes our list for the most interesting contenders in the quest for more expressivity, this year:
Karlapati et al.’s goal is to transfer the prosody (very loosely speaking, aspects like intonation, rhythm, emphasis etc.) from a source sentence to a target. Given an utterance of
Sentence 1 by
Speaker A, they aim to get
Speaker B to synthesise the same sentence with the same prosodic style. Using matching source and target texts, they encode the reference mel spectrogram into a temporal series of embeddings to encourage it to transfer prosodic information while discouraging any speaker information from leaking through. Another popular route for improving expressivity was the incorporation of linguistic or semantic information. Hu et al analysed the suitability of automatic ToBI markings (with a neat python wrapper presented in Interspeech 2019) for Dutch and proposed additional features that would enhance a Dutch synthesis system. Kenter et al. went down the language model route instead and augmented an English TTS system with a BERT model. They observed that a smaller BERT model actually worked better than a large one, and that fine-tuning this model was pivotal to obtaining a high level of performance from the overall system! Hono et al. also leveraged a linguistic model but additionally model the latent space hierarchically. They learn word-level latents, which are conditioned on (learnt) phrase level latents, which in turn are conditioned on (learnt) utterance level latents, arguing that the different aspects of prosody are better modelled at varying temporal scales. Further, Zhang et al. extract prosodic information at syllable level only and demonstrate an improvement in preference score as opposed to a phoneme level approach.
And that only touches on a handful of papers in this area, which goes to show that the breadth of work to be done on getting synthesised voices to sound more expressive! Exciting times indeed!
With the rapid adoption of machine learning models in production, interpretability and controllability has been high on the list of desiderata. In speech, a particularly promising aspect of this from our perspective is the quest for controllable TTS models. While a number of papers in the section on Expressive TTS claim to obtain varying levels of controllability as a corollary (Hono et al., Zhang et al., etc.), in this section we focus on papers for whom this was a central theme.
Raitio et al. propose a ‘prosody encoder’ which learns (in a supervised manner) to predict a global acoustic feature vector from the phonemes. The upshot of this approach is that one can bias this acoustic feature vector to alter the characteristics of the synthesised output. Their samples page is well worth a look (or more aptly, a listen) for those curious about how effective the approach is. Morrison et al. have developed a model which can take into consideration temporal prosody constraints specified by a user. They propose a pitch-shifting neural vocoder which enables them to synthesise the variety of pitch contours without a degradation in quality. Once again, their samples page might be well worth your time! In the interestingly named Hider-Finder-Combiner model, Webber et al. propose an architecture for modifying arbitrary parameters of a speech signal, and demonstrate the utility of their approach by specifically modifying the F0 of the synthesised speech. The Hider is tasked with generating an informative embedding, but hiding all information about the control parameter, and is trained adversarially against the Finder. The goal of the Combiner is then to combine this embedding with the new, desired value of the control parameter.
As we begin to understand the nuances of some of our models better, we are sure to get further granularity of control in our synthesised speech. However, even at this stage, the approaches (and samples!) above allow us a glimpse at the massive landscape that is the arena of fully controllable TTS!
Even with the convenient virtual format, there was no way for our small team to stay on top of all the cool research being presented this year. Instead, we tried to make the most of it by focusing on some of the key areas highlighted above. That being said, we also saw a lot of different ideas we liked in the process, and they were nothing if not highlights in their own right. Here are just a few we happily stumbled upon:
- Towards Learning a Universal Non-Semantic Representation of Speech: Shor et al. not only propose a set of benchmark non-semantic tasks (think VTAB for vision and GLUE for natural language understanding) to assess speech representations, they also submit a triplet-loss objective for learning them. We appreciated how the loss function simply captures the intuition that non-semantic information like speaker identity and emotion changes slowly over time, and we’re excited to see what this new benchmark could mean for the assessment of effective transfer learning in the speech domain.
- An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets: Having observed for ourselves the varying effect of including and excluding specific speakers during training, it’s gratifying to see Oplustil et al. validate that more data isn’t always better. They compare models trained on different speaker subsets identified using k-means clustering of speaker representations and find that a good subset model outperforms the baseline trained on all speakers in a subjective listening test of naturalness. We’re definitely keen to see more research in the area of automating what constitutes a good dataset!
- Attention Forcing for Speech Synthesis: Dou et al. do a good job here of highlighting some of the existing issues around training autoregressive seq2seq models and giving an overview of approaches like scheduled sampling, teacher forcing, and professor forcing that have been proposed to address them. They propose another training approach, attention forcing, which neatly delegates the responsibilities of learning alignment and inferring output to two separate models. Teacher forcing with ground truth output in one model with the goal of learning alignments is followed by attention forcing with those learned alignments as reference in the inference model, resulting in improvements in speech quality.
- A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals: Dong and Williamson take an ambitious approach to automating the evaluation of speech quality without resorting to objective measures as a proxy for human judgement. They conduct a large-scale listening test using real audio data under different conditions and compile more than 180,000 human rating samples in order to train a model to predict MOS scores. While we find the deep learning spin on human predictability and perception promising, we’re even more impressed by the dedication involved in making this happen!
That’s it folks from our first ever, virtual INTERSPEECH 2020 experience! Looking forward to seeing you all at INTERSPEECH 2021 in Brno!
Subscribe to the blog
Receive all the latest posts right into your inbox