首页 | 本学科首页   官方微博 | 高级检索  
     


Building a controllable expressive speech synthesis system with multiple emotion strengths
Affiliation:1. Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati 781039, India;2. Department of Electronics and Electrical Engineering, Indian Institute of Technology Dharwad, Dharwad 580011, India
Abstract:Emotion is considered to be an essential element in the performance of human-computer interactions. In expressive synthesis speech, it is important to generate emotional speech that reflects subtle and complex emotional states. However, there has been limited research on how to effectively synthesize emotional speech using different levels of emotion strength with intuitive control, which is difficult to be modeled effectively. In this paper, we explore an expressive speech synthesis model that can be used to produce speech with multiple emotion strengths. Unlike previous studies that encoded emotions into discrete codes, we propose an embedding vector to continuously control the emotion strength, which is a data-driven method to synthesize speech with a fine control over the emotions. Compared with the models using the retraining technique or a one-hot vector, our proposed model using an embedding vector can explicitly learn the high-level emotion strength from the low-level acoustic features. As a result, we can control the emotion strength of synthetic speech in a relatively predictable and globally consistent way. The objective and subjective evaluation tests show that our proposed model achieves state-of-the-art performance in terms of model flexibility and controllability.
Keywords:Expressive speech synthesis  Emotion strength  Text-to-speech  Emotion control  Neural networks
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号