Tutorial on ADEPT evaluation: a step-by-step guide
This how-to guide will give step-by-step recommendations on how to set up an ADEPT evaluation using the ADEPT corpus. If you’d like to know more about the research behind ADEPT, you can find our paper on Arxiv here, or our accompanying blog post here.
ADEPT is meant to evaluate English TTS prosody transfer models (models that transfer prosody from an English reference to an English target). We do not yet know how useful the corpus is for evaluating cross-lingual prosody transfer, nor prosody in general. But we hope for those to be our next steps with ADEPT research.
Step 1: Determine which aspects of prosody you intend to transfer
[Skip to step 2 if you wish to set up all 12 ADEPT evaluation tasks.]
As discussed in the paper, we’ve identified six classes of speech that have prosodic effect on F0, amplitude, duration, spectral tilt and/or segmental reduction. Additionally, we highlight that previous research has suggested that some of these classes have local effect, meaning the acoustic cues are affected differently throughout the utterance, whilst others have global effect; the effect on acoustic cues is more consistent throughout an utterance. Though we did not test any of these claims directly in our research, they can still be a helpful guide for determining which classes of speech to test.
For example, if your model is embedding the F0 range of the reference sample and transferring this onto the target, then your model will probably perform better at transferring speech classes that have global effect on F0 (emotion and interpersonal attitude). We’ve broken down the expected prosodic effects based on previous research in the table below.
Emotion | Interpersonal attitude | Propositional attitude | Topical emphasis | Syntactic phrasing | Marked tonicity | |
---|---|---|---|---|---|---|
Global F0 | ✅ | ✅ | ||||
Global amplitude | ✅ | ✅ | ||||
Global duration | ✅ | ✅ | ||||
Global spectral tilt | ✅ | ✅ | ||||
Local F0 | ✅ | ✅ | ✅ | |||
Local amplitude | ✅ | ✅ | ✅ | |||
Local duration | ✅ | ✅ | ✅ | ✅ | ||
Local spectral tilt | ✅ | |||||
Local segmental reduction | ✅ |
We do not claim that this list is exhaustive; classes may have additional prosodic effects that are not outlined in the table. However, if you do not wish to set up all 12 evaluation tasks (6 classes 2 speakers [male and female]), the above table can guide you towards which tasks to select.
Step 2: Generate your samples
To evaluate a single model, you’ll need to synthesise samples in all of the subcategories or interpretations for each speaker and class you’re evaluating. You should use the relevant ADEPT sample as the reference, and synthesise the text of that reference. For example, if you’re evaluating female syntactic phrasing, you should generate 10 samples, whose prosody is transferred from the following, and whose text is the same as the following:
syntactic_phrasing/interpretation_1/ad00_0400.wav
syntactic_phrasing/interpretation_1/ad00_0401.wav
syntactic_phrasing/interpretation_1/ad00_0402.wav
syntactic_phrasing/interpretation_1/ad00_0403.wav
syntactic_phrasing/interpretation_1/ad00_0404.wav
syntactic_phrasing/interpretation_2/ad00_0400.wav
syntactic_phrasing/interpretation_2/ad00_0401.wav
syntactic_phrasing/interpretation_2/ad00_0402.wav
syntactic_phrasing/interpretation_2/ad00_0403.wav
syntactic_phrasing/interpretation_2/ad00_0404.wav
These 10 samples correspond to 2 interpretations for each of the 5 syntactic phrasing sentences.
The following table shows how many samples to generate for each speaker and class. Thus, to perform a full ADEPT evaluation (all 12 tasks), you need to generate 195 samples.
Emotion | Interpersonal attitude | Propositional attitude | Topical emphasis | Syntactic phrasing | Marked tonicity | Total | |
---|---|---|---|---|---|---|---|
Female | 25 | 20 | 15 | 15 | 10 | 10 | 95 |
Male | 25 | 20 | 20 | 15 | 10 | 10 | 100 |
Total | 50 | 40 | 35 | 30 | 20 | 20 | 195 |
Step 3: Make your evaluation tasks
Some classes use a multiple-stimulus design, in which listeners hear multiple audio samples per question, whilst other classes have only have a single stimulus per question. For the three classes with a single stimulus design (topical emphasis, syntactic phrasing, and marked tonicity), possible answer choices for each question are available on Zenodo within adept_prompts.json
.
For all tasks, we advise having at least 30 participants. All samples you use in your own evaluation should be the samples you generated with your prosody transfer model, not our natural samples.
Emotion
Both the female and male emotion tasks use a multi-stimulus design with 20 questions, and 5 choices per question. The choices are the same sentence said in each of the subcategories + neutral. Each of the 5 sentences is presented 4 times (once per subcategory, and hence 20 total questions), but each time the sentence is presented, the question and correct answer are different. The question asked is In which of the following samples does the speaker sound most X? where X is one of: fearful, sad, joyful, or angry; and the correct answer is the fearful, sad, joyful, or angry sample for that sentence respectively.
As an example, see our male test below.
You must have 3rd party cookies enabled for the embedded test to work. If you do not wish to enable 3rd party cookies, you can instead visit the link directly here.
Interpersonal attitude
Similarly, both the female and male interpersonal attitude tasks use a multi-stimulus design with 15 questions, and 4 choices per question (one per subcategory + neutral). Listeners are asked Which of the following samples sounds most like X?, where x is one of: an authoritative statement, a contemptuous question, or a polite statement. For example, see our female test below.
You must have 3rd party cookies enabled for the embedded test to work. If you do not wish to enable 3rd party cookies, you can instead visit the link directly here.
Propositional attitude
Propositional attitude also has a multi-stimulus design. The male task has 15 questions with 4 choices each, whilst the female task has only 10 questions with 3 choices each because it has one fewer subcategory. Listeners are asked Which sample fits best into the context: X, where X is one of:
“(Incredulous) Really? ___?”
“(Surprised) Wow! ___!”
“(Sarcastically) Well ___.” [male only]
You must have 3rd party cookies enabled for the embedded test to work. If you do not wish to enable 3rd party cookies, you can instead visit the link directly here.
Topical emphasis
Both the male and female topical emphasis tasks are a single-stimulus design, each with 15 questions and 3 choices per question. Listeners are asked Which question is best answered by the sample? And the choices imply emphasis either at the beginning, middle or end of the sentence.
For example, there are three questions asking about sentence ad00_0100
(one with the beginning sample, one with the middle sample, and one with the end sample). And for all three of these questions, the three choices are:
a. WHO eats avocado with toast?
b. Vegans eat WHAT with toast?
c. Vegans eat avocado with WHAT?
You must have 3rd party cookies enabled for the embedded test to work. If you do not wish to enable 3rd party cookies, you can instead visit the link directly here.
Syntactic phrasing
Similarly both the male and female syntactic phrasing tasks are a single-stimulus design, each with 10 questions and 2 choices per question: paraphrases that imply the two possible interpretations of the sentence. Listeners are asked Which is a better paraphrase of the sample?
You must have 3rd party cookies enabled for the embedded test to work. If you do not wish to enable 3rd party cookies, you can instead visit the link directly here.
Marked tonicity
Lastly both the male and female marked tonicity tasks are also a single-stimulus design, with 10 questions and 2 interpretations of the sentence per question.
You must have 3rd party cookies enabled for the embedded test to work. If you do not wish to enable 3rd party cookies, you can instead visit the link directly here.
Step 4: Evaluating your results
With this set-up, each question has only one right answer, and the neutral sample is never correct. To evaluate a given subcategory, consider the following example: if you have 30 participants perform the female topical emphasis task, you’ll have a total of 30 15 questions = 450 responses. Of those responses, 1/3 of them (150 responses) will have had “beginning” as the right answer. And of those 150 responses, if 120 were correctly answered by your participants, then for female beginning topical emphasis you have a recognition accuracy of 80%.
For classes without subcategories, syntactic phrasing and marked tonicity, you can only evaluate the recognition accuracy of the class as a whole.
Step 5: Share your results!
Hooray you did it! I’d love to know if our proposed system is getting used. So please do share your results with the wider research community.
Subscribe to the blog
Receive all the latest posts right into your inbox