These are speech samples we have used for userstudy.
[Regarding gender mismatch] Our speech samples preserve the speaker's gender, which can be inferred from the generated speech descriptions. However, the generated voice differs from that of the original speaker in the dialogue, which we acknowledge as a limitation in the Conclusion section.
In contrast, other TTS models, such as Parler-TTS, StyleTTS2, and HierSpeech++, do not explicitly incorporate gender information. Instead, they either use a default gender setting or rely on the characteristics of their officially pretrained models.
Additionally, we have instructed evaluators not to consider the gender of the voice during the user study, as stated in the survey.
Transcript |
Ours |
Parler-TTS (w.o. description) |
StyleTTS2 |
HierSpeech++ |