首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The perception of the distinction between /r/ and /l/ by native speakers of American English and of Japanese was studied using natural and synthetic speech. The American subjects were all nearly perfect at recognizing the natural speech sounds, whereas there was substantial variation among the Japanese subjects in their accuracy of recognizing /r/ and /l/ except in syllable-final position. A logit model, which additively combined the acoustic information conveyed by F1-transition duration and by F3-onset frequency, provided a good fit to the perception of synthetic /r/ and /l/ by the American subjects. There was substantial variation among the Japanese subjects in whether the F1 and F3 cues had a significant effect on their classifications of the synthetic speech. This variation was related to variation in accuracy of recognizing natural /r/ and /l/, such that greater use of both the F1 cue and the F3 cue in classifying the synthetic speech sounds was positively related to accuracy in recognizing the natural sounds. However, multiple regression showed that use of the F1 cue did not account for significant variance in natural speech performance beyond that accounted for by the F3 cue, indicating that the F3 cue is more important than the F1 cue for Japanese speakers learning English. The relation between performance on natural and synthetic speech also provides external validation of the logit model by showing that it predicts performance outside of the domain of data to which it was fit.  相似文献   

2.
The perception of the distinction between /r/ and /l/ by native speakers of American English and of Japanese was studied using natural and synthetic speech. The American subjects were all nearly perfect at recognizing the natural speech sounds, whereas there was substantial variation among the Japanese subjects in their accuracy of recognizing /r/ and /l/ except in syllable-final position. A logit model, which additively combined the acoustic information conveyed byF1-transition duration and byF3-onset frequency, provided a good fit to the perception of synthetic /r/ and /l/ by the American subjects. There was substantial variation among the Japanese subjects in whether theF1 andF3 cues had a significant effect on their classifications of the synthetic speech. This variation was related to variation in accuracy of recognizing natural /r/ and /l/, such that greater use of both theF1 cue and theF3 cue in classifying the synthetic speech sounds was positively related to accuracy in recognizing the natural sounds. However, multiple regression showed that use of theF1 cue did not account for significant variance in natural speech performance beyond that accounted for by theF3 cue, indicating that theF3 cue is more important than theF1 cue for Japanese speakers learning English. The relation between performance on natural and synthetic speech also provides external validation of the logit model by showing that it predicts performance outside of the domain of data to which it was fit.  相似文献   

3.
The performance of Spanish-English bilinguals in two perception tasks, using a synthetic speech continuum varying in voice onset time, was compared with the performance of Spanish and English monolinguals. Voice onset time in speech production was also compared between these groups. Results in perception of bilinguals differed from that of both monolingual groups. Results of bilingual production in their two languages conformed with results obtained from each monolingual group. The perceptual results are interpreted in terms of differences in the use of available acoustic cues by bilingual and monolingual listeners of English and Spanish.  相似文献   

4.
Emotion is considered to be an essential element in the performance of human-computer interactions. In expressive synthesis speech, it is important to generate emotional speech that reflects subtle and complex emotional states. However, there has been limited research on how to effectively synthesize emotional speech using different levels of emotion strength with intuitive control, which is difficult to be modeled effectively. In this paper, we explore an expressive speech synthesis model that can be used to produce speech with multiple emotion strengths. Unlike previous studies that encoded emotions into discrete codes, we propose an embedding vector to continuously control the emotion strength, which is a data-driven method to synthesize speech with a fine control over the emotions. Compared with the models using the retraining technique or a one-hot vector, our proposed model using an embedding vector can explicitly learn the high-level emotion strength from the low-level acoustic features. As a result, we can control the emotion strength of synthetic speech in a relatively predictable and globally consistent way. The objective and subjective evaluation tests show that our proposed model achieves state-of-the-art performance in terms of model flexibility and controllability.  相似文献   

5.
6.
Candidate brain regions constituting a neural network for preattentive phonetic perception were identified with fMRI and multivariate multiple regression of imaging data. Stimuli contrasted along speech/nonspeech, acoustic, or phonetic complexity (three levels each) and natural/synthetic dimensions. Seven distributed brain regions' activity correlated with speech and speech complexity dimensions, including five left-sided foci [posterior superior temporal gyrus (STG), angular gyrus, ventral occipitotemporal cortex, inferior/posterior supramarginal gyrus, and middle frontal gyrus (MFG)] and two right-sided foci (posterior STG and anterior insula). Only the left MFG discriminated natural and synthetic speech. The data also supported a parallel rather than serial model of auditory speech and nonspeech perception.  相似文献   

7.
We trained unilingual adult Canadian francophone listeners to identify the English voiceless and voiced linguadental ("th") frictives, /theta/ and /delta/, using synthetic exemplars of each phoneme. Identification training with feedback improved listeners' abilities to identify both natural and synthetic tokens. These results show that training with appropriately selected prototype stimuli can produce a linguistically meaningful improvement in a listener's ability to identify new, nonnative speech sounds--both natural and synthetic. However, it is not yet clear whether such training with a single prototype can improve performance as effectively as the fading technique used by Jamieson and Morosan (1986), since prototype training produced only a moderate improvement in listeners' identifications of synthetic stimuli containing brief frication. Differences between the techniques may reflect the need for listeners to experience the appropriate types of intraphonemic and interphonemic variability during training. Such variability may help to define the category prototype by desensitizing the subjects to differences between exemplars of the same category and by sharpening sensitivity to differences between categories.  相似文献   

8.
Consistent Truth     
Hartley Slater 《Ratio》2014,27(3):247-261
Modern Logic has generated a lot of problems for itself through inattention to natural forms of speech. In particular it has had difficulties with a large group of ‘logical paradoxes’ through its preoccupation with the Predicate Calculus and related structures to the exclusion of other formal structures that represent natural language more fully, and thereby escape these paradoxes. In natural speech the unrecognized forms involved are principally individual referring terms with a non‐specific or fictional reference. For, under the influence of the Logical Empiricists, the focus in formal logic has been on individual terms with a specific, factual reference. But a similar point arises with certain predicates, and indeed even the proper concept of a predicate has been lost, producing the most notorious of the standard puzzles: Russell's Paradox. The paper inevitably makes reference to contemporary logical symbolism in places, but its purpose is to show that the only way forward is to use a symbolism that is abbreviatory only, i.e. maps directly onto natural speech.  相似文献   

9.
A patient with a rather pure word deafness showed extreme suppression of right ear signals under dichotic conditions, suggesting that speech signals were being processed in the right hemisphere. Systematic errors in the identification and discrimination of natural and synthetic stop consonants further indicated that speech sounds were not being processed in the normal manner. Auditory comprehension improved considerably however, when the range of speech stimuli was limited by contextual constraints. Possible implications for the mechanism of word deafness are discussed.  相似文献   

10.
In five experiments with synthetic and natural speech syllables, a rating task we used to study the effects of differences in vowels, consonants, and segment order on judged syllable similarity. The results of Experiments I-IV support neither a purely phonemic model of speech representation, in which vowel, consonant, and order are represented independently, nor a purely syllabic model, in which the three factors are integrated. Instead, the data indicate that subjects compare representations in which adjacent vowel and consonant are independent of one another but are not independent of their positions in the syllable. Experiment V provided no support for the hypothesis that this position-sensitive coding is due to acoustic differences in formant transitions.  相似文献   

11.
Phonological fusion occurs when the phonemes of two different speech stimuli are combined into a new percept that is longer and linguistically more complex than either of the two inputs. For example, when PAY is presented to one ear and LAY to the other, the subject often perceives PLAY. The present article is an investigation of the conditions necessary and sufficient for fusion to occur. The rules governing phonological fusion appear to be the same for synthetic and natural speech, but synthetic stimuli fuse more readily. Fusion occurs considerably more often in dichotic stimulus presentation than in binaural presentation. The phenomenon is remarkably tolerant of differences in relative onset time between the to-be-fused stimuli and of relative differences in fundamental frequency, intensity, and vocal tract configuration. Although phonological fusion is insensitive to such nonlinguistic stimulus parameters, it is sensitive to linguistic variations at the semantic, phonemic, and acoustic levels.  相似文献   

12.
Certain attributes of a syllable-final liquid can influence the perceived place of articulation of a following stop consonant. To demonstrate this perceptual context effect, the CV portions of natural tokens of [al-da], [al-ga], [ar-da], [ar-ga] were excised and replaced with closely matched synthetic stimuli drawn from a [da]-[ga] continuum. The resulting hybrid disyllables were then presented to listeners who labeled both liquids and stops. The natural CV portions had two different effects on perception of the synthetic CVs. First, there was an effect of liquid category: Listeners perceived “g” more often in the context of [al] than in that of [ar]. Second, there was an effect due to tokens of [al] and [ar] having been produced before [da] or [ga]: More “g” percepts occurred when stops followed liquids that had been produced before [g]. A hypothesis that each of these perceptual effects finds a parallel in speech production is supported by spectrograms of the original utterances. Here, it seems, is another instance in which findings in speech perception reflect compensation for coarticulation during speech production.  相似文献   

13.
Speech perception can be viewed in terms of the listener’s integration of two sources of information: the acoustic features transduced by the auditory receptor system and the context of the linguistic message. The present research asked how these sources were evaluated and integrated in the identification of synthetic speech. A speech continuum between the glide-vowel syllables /ri/ and /li/ was generated by varying the onset frequency of the third formant. Each sound along the continuum was placed in a consonant-cluster vowel syllable after an initial consonant /p/, /t/, /s/, and /v/. In English, both /r/ and /l/ are phonologically admissible following /p/ but are not admissible following /v/. Only /l/ is admissible following /s/ and only /r/ is admissible following /t/. A third experiment used synthetic consonant-cluster vowel syllables in which the first consonant varied between /b/ and /d and the second consonant varied between /l/ and /r/. Identification of synthetic speech varying in both acoustic featural information and phonological context allowed quantitative tests of various models of how these two sources of information are evaluated and integrated in speech perception.  相似文献   

14.
Identification and discrimination of two-formant [bae-dae-gae] and [pae-tae-kae] synthetic speech stimuli and discrimination of corresponding isolated second formant transitions (chirps) were performed by six subjects. Stimuli were presented at several intensity levels such that the intensity of the F2 transition was equated between speech and nonspeech stimuli, or the overall intensity of the stimulus was equated. At higher intensity (92 dB), b-d-g and p-t-k identification and between-category discrimination performance declined and bilabial-alveolar phonetic boundaries shifted in location on the continuum towards the F2 steady-state frequency. Between-category discrimination improved from performance at 92 dB when 92-dB speech stimuli were simultaneously masked by 60-dB speech noise; alveolar-velar boundaries shifted to a higher frequency location in the 92-dB-plus-noise condition. Chirps were discriminated categorically when presented at 58 dB, but discrimination peaks declined at higher intensities. Perceptual performance for chirps and p-t-k stimuli was very similar, and slightly inferior to performance for b-d-g stimuli, where simultaneous masking by F1 resulted in a lower effective intensity of F2. The results were related to a suggested model involving pitch comparison and transitional quality perceptual strategies.  相似文献   

15.
This three-part study demonstrates that perceptual order can influence the integration of acoustic speech cues. In Experiment 1, the subjects labeled the [s] and [sh] in natural FV and VF syllables in which the frication was replaced with synthetic stimuli. Responses to these "hybrid" stimuli were influenced by cues in the vocalic segment as well as by the synthetic frication. However, the influence of the preceding vocalic cues was considerably weaker than was that of the following vocalic cues. Experiment 2 examined the acoustic bases for this asymmetry and consisted of analyses revealing that FV and VF syllables are similar in terms of the acoustic structures thought to underlie the vocalic context effects. Experiment 3 examined the perceptual bases for the asymmetry. A subset of the hybrid FV and VF stimuli were presented in reverse, such that the acoustic and perceptual bases for the asymmetry were pitted against each other in the listening task. The perceptual bases (i.e., the perceived order of the frication and vocalic cues) proved to be the determining factor. Current auditory processing models, such as backward recognition masking, preperceptual auditory storage, or models based on linguistic factors, do not adequately account for the observed asymmetries.  相似文献   

16.
Children often talk themselves through their activities, producing private speech that is internalized to form inner speech. This study assessed the effect of articulatory suppression (which suppresses private and inner speech) on Tower of London performance in 7- to 10-year-olds, relative to performance in a control condition with a nonverbal secondary task. Experiment 1 showed no effect of articulatory suppression on performance with the standard Tower of London procedure; we interpret this in terms of a lack of planning in our sample. Experiment 2 used a modified procedure in which participants were forced to plan ahead. Performance in the articulatory suppression condition was lower than that in the control condition, consistent with a role for self-directed (private and inner) speech in planning. On problems of intermediate difficulty, participants producing more private speech in the control condition showed greater susceptibility to interference from articulatory suppression than their peers, suggesting that articulatory suppression interfered with performance by blocking self-directed (private and inner) speech.  相似文献   

17.
This three-part study demonstrates that perceptual order can influence the integration of acoustic speech cues. In Experiment 1, the subjects labeled the [s] and [∫] in natural FV and VF syllables in which the frication was replaced with synthetic stimuli. Responses to these “hybrid” stimuli were influenced by cues in the vocalic segment as well as by the synthetic frication. However, the influence of the preceding vocalic cues was considerably weaker than was that of the following vocalic cues. Experiment 2 examined the acoustic bases for this asymmetry and consisted of analyses revealing that FV and VF syllables are similar in terms of the acoustic structures thought to underlie the vocalic context effects. Experiment 3 examined the perceptual bases for the asymmetry. A subset of the hybrid FV and VF stimuli were presented inreverse, such that the acoustic and perceptual bases for the asymmetry were pitted against each other in the listening task. The perceptual bases (i.e., the perceived order of the frication and vocalic cues) proved to be the determining factor. Current auditory processing models, such as backward recognition masking, preperceptual auditory storage, or models based on linguistic factors, do not adequately account for the observed asymmetries.  相似文献   

18.
选取汉语中存在语音意识缺陷的阅读障碍儿童、正常儿童和成人各25名为被试,考察了语音型阅读障碍儿童是否存在言语知觉缺陷.言语知觉任务采用范畴知觉范式,要求被试识别合成或自然的语音范畴连续体.结果发现语音型阅读障碍儿童识别合成和自然的刺激都表现出范畴知觉缺陷,对范畴内刺激的识别缺少一致性;个体分析表明大部分语音型阅读障碍儿童有较低的识别函数斜率;回归分析表明言语知觉技能通过语音意识的中介作用于阅读能力的发展.  相似文献   

19.
The performance of 14 poor readers on an audiovisual speech perception task was compared with 14 normal subjects matched on chronological age (CA) and 14 subjects matched on reading age (RA). The task consisted of identifying synthetic speech varying in place of articulation on an acoustic 9-point continuum between /ba/ and /da/ (Massaro & Cohen, 1983). The acoustic speech events were factorially combined with the visual articulation of /ba/, /da/, or none. In addition, the visual-only articulation of /ba/ or /da/ was presented. The results showed that (1) poor readers were less categorical than CA and RA in the identification of the auditory speech events and (2) that they were worse in speech reading. This convergence between the deficits clearly suggests that the auditory speech processing difficulty of poor readers is speech specific and relates to the processing of phonological information.  相似文献   

20.
The experiments reported examine the effects of two highly related variables, word frequency and age of acquisition, on short-term memory span. Short-term memory span and speech rate were measured for sets of words which independently manipulated frequency and age of acquisition. It was found that frequency had a considerable effect on short-term memory span, which was not mediated by speech rate differences—although frequency did affect speech rate in one experiment. For age of acquisition, this situation was reversed; there was a small but significant effect of age of acquisition on speech rate, but no effect on memory span. This occurred despite results confirming that the stimuli used in the experiments produce an effect of age of acquisition on word naming. The results are discussed in terms of a two-component view of performance on short-term memory tasks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号