26 August 2021/8 min read

Tutorial on ADEPT evaluation: a step-by-step guide

This how-to guide will give step-by-step recommendations on how to set up an ADEPT evaluation using the ADEPT corpus. If you’d like to know more about the research behind ADEPT, you can find our paper on Arxiv here, or our accompanying blog post here.

ADEPT is meant to evaluate English TTS prosody transfer models (models that transfer prosody from an English reference to an English target). We do not yet know how useful the corpus is for evaluating cross-lingual prosody transfer, nor prosody in general. But we hope for those to be our next steps with ADEPT research.

Step 1: Determine which aspects of prosody you intend to transfer

[Skip to step 2 if you wish to set up all 12 ADEPT evaluation tasks.]

As discussed in the paper, we’ve identified six classes of speech that have prosodic effect on F0, amplitude, duration, spectral tilt and/or segmental reduction. Additionally, we highlight that previous research has suggested that some of these classes have local effect, meaning the acoustic cues are affected differently throughout the utterance, whilst others have global effect; the effect on acoustic cues is more consistent throughout an utterance. Though we did not test any of these claims directly in our research, they can still be a helpful guide for determining which classes of speech to test.

For example, if your model is embedding the F0 range of the reference sample and transferring this onto the target, then your model will probably perform better at transferring speech classes that have global effect on F0 (emotion and interpersonal attitude). We’ve broken down the expected prosodic effects based on previous research in the table below.

	Emotion	Interpersonal attitude	Propositional attitude	Topical emphasis	Syntactic phrasing	Marked tonicity
Global F0	✅	✅
Global amplitude	✅	✅
Global duration	✅	✅
Global spectral tilt	✅	✅
Local F0			✅	✅		✅
Local amplitude			✅	✅		✅
Local duration			✅	✅	✅	✅
Local spectral tilt			✅
Local segmental reduction						✅

We do not claim that this list is exhaustive; classes may have additional prosodic effects that are not outlined in the table. However, if you do not wish to set up all 12 evaluation tasks (6 classes $\times$ 2 speakers [male and female]), the above table can guide you towards which tasks to select.

Step 2: Generate your samples

To evaluate a single model, you’ll need to synthesise samples in all of the subcategories or interpretations for each speaker and class you’re evaluating. You should use the relevant ADEPT sample as the reference, and synthesise the text of that reference. For example, if you’re evaluating female syntactic phrasing, you should generate 10 samples, whose prosody is transferred from the following, and whose text is the same as the following:
syntactic_phrasing/interpretation_1/ad00_0400.wav
syntactic_phrasing/interpretation_1/ad00_0401.wav
syntactic_phrasing/interpretation_1/ad00_0402.wav
syntactic_phrasing/interpretation_1/ad00_0403.wav
syntactic_phrasing/interpretation_1/ad00_0404.wav
syntactic_phrasing/interpretation_2/ad00_0400.wav
syntactic_phrasing/interpretation_2/ad00_0401.wav
syntactic_phrasing/interpretation_2/ad00_0402.wav
syntactic_phrasing/interpretation_2/ad00_0403.wav
syntactic_phrasing/interpretation_2/ad00_0404.wav

These 10 samples correspond to 2 interpretations for each of the 5 syntactic phrasing sentences.

The following table shows how many samples to generate for each speaker and class. Thus, to perform a full ADEPT evaluation (all 12 tasks), you need to generate 195 samples.

	Emotion	Interpersonal attitude	Propositional attitude	Topical emphasis	Syntactic phrasing	Marked tonicity	Total
Female	25	20	15	15	10	10	95
Male	25	20	20	15	10	10	100
Total	50	40	35	30	20	20	195

Step 3: Make your evaluation tasks

Some classes use a multiple-stimulus design, in which listeners hear multiple audio samples per question, whilst other classes have only have a single stimulus per question. For the three classes with a single stimulus design (topical emphasis, syntactic phrasing, and marked tonicity), possible answer choices for each question are available on Zenodo within adept_prompts.json.

For all tasks, we advise having at least 30 participants. All samples you use in your own evaluation should be the samples you generated with your prosody transfer model, not our natural samples.

Emotion

Both the female and male emotion tasks use a multi-stimulus design with 20 questions, and 5 choices per question. The choices are the same sentence said in each of the subcategories + neutral. Each of the 5 sentences is presented 4 times (once per subcategory, and hence 20 total questions), but each time the sentence is presented, the question and correct answer are different. The question asked is In which of the following samples does the speaker sound most X? where X is one of: fearful, sad, joyful, or angry; and the correct answer is the fearful, sad, joyful, or angry sample for that sentence respectively.

As an example, see our male test below.