首页 | 本学科首页   官方微博 | 高级检索  
   检索      


Speaker identification using multi-modal i-vector approach for varying length speech in voice interactive systems
Institution:1. Department of Electronics and Communication Engineering, Visvesvaraya National Institute of Technology, South Ambazari Road, Nagpur 40010, India;2. Department of Electronics and Communication Engineering, National Institute of Technology Campus Warangal, Telangana 506004, India;3. Department of Instrumentation and Applied Physics, Indian Institute of Science, C V Raman Ave, Bengaluru 560012, India;1. Department of EIE, Dr. Mahalingam College of Engineering and Technology, Pollachi, Coimbatore, India;2. Department of EEE, Dr. Mahalingam College of Engineering and Technology, Pollachi, Coimbatore, India;1. Department of Mathematics, Shaanxi University of Science & Technology, Xi’an 710021, China;2. Department of Mathematics, Shanghai Maritime University, Shanghai 201306, China;3. Department of Mathematics, University of New Mexico, Gallup, NM 87301, USA;4. Department of Mathematics, Obafemi Awolowo University, Ile Ife 220005, Nigeria;1. School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China;2. National Institute of Telecommunications (Inatel), Santa Rita do Sapucaí, MG, Brazil;3. Instituto de Telecommunicações, Portugal;4. University of Fortaleza (UNIFOR), Fortaleza, CE, Brazil;5. School of Computer Science and Engineering, Beihang University, Beijing 100191, China;1. Department of Information Systems, Faculty of Commerce & Business Administration, Helwan University, Cairo, Egypt;2. Department of Computer Science, Faculty of Computers and Informatics, Sharqiyah, Cairo, Egypt
Abstract:The development in the interface of smart devices has lead to voice interactive systems. An additional step in this direction is to enable the devices to recognize the speaker. But this is a challenging task because the interaction involves short duration speech utterances. The traditional Gaussian mixture models (GMM) based systems have achieved satisfactory results for speaker recognition only when the speech lengths are sufficiently long. The current state-of-the-art method utilizes i-vector based approach using a GMM based universal background model (GMM-UBM). It prepares an i-vector speaker model from a speaker’s enrollment data and uses it to recognize any new test speech. In this work, we propose a multi-model i-vector system for short speech lengths. We use an open database THUYG-20 for the analysis and development of short speech speaker verification and identification system. By using an optimum set of mel-frequency cepstrum coefficients (MFCC) based features we are able to achieve an equal error rate (EER) of 3.21% as compared to the previous benchmark score of EER 4.01% on the THUYG-20 database. Experiments are conducted for speech lengths as short as 0.25 s and the results are presented. The proposed method shows improvement as compared to the current i-vector based approach for shorter speech lengths. We are able to achieve improvement of around 28% even for 0.25 s speech samples. We also prepared and tested the proposed approach on our own database with 2500 speech recordings in English language consisting of actual short speech commands used in any voice interactive system.
Keywords:Gaussian mixture models  i-Vectors  Mel-frequency cepstrum coefficients  Speaker verification  Speaker identification  Short speech  Voice interactive systems
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号