Building a controllable expressive speech synthesis system with multiple emotion strengths |
| |
Affiliation: | 1. Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati 781039, India;2. Department of Electronics and Electrical Engineering, Indian Institute of Technology Dharwad, Dharwad 580011, India |
| |
Abstract: | Emotion is considered to be an essential element in the performance of human-computer interactions. In expressive synthesis speech, it is important to generate emotional speech that reflects subtle and complex emotional states. However, there has been limited research on how to effectively synthesize emotional speech using different levels of emotion strength with intuitive control, which is difficult to be modeled effectively. In this paper, we explore an expressive speech synthesis model that can be used to produce speech with multiple emotion strengths. Unlike previous studies that encoded emotions into discrete codes, we propose an embedding vector to continuously control the emotion strength, which is a data-driven method to synthesize speech with a fine control over the emotions. Compared with the models using the retraining technique or a one-hot vector, our proposed model using an embedding vector can explicitly learn the high-level emotion strength from the low-level acoustic features. As a result, we can control the emotion strength of synthetic speech in a relatively predictable and globally consistent way. The objective and subjective evaluation tests show that our proposed model achieves state-of-the-art performance in terms of model flexibility and controllability. |
| |
Keywords: | Expressive speech synthesis Emotion strength Text-to-speech Emotion control Neural networks |
本文献已被 ScienceDirect 等数据库收录! |
|