首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Synthetic continua of two minimal pairs, BAIT-DATE and DATE-GATE, closely modeled on natural utterances by a female speaker, were presented to a group of 16 listeners for identification in full-cue and reduced-cue conditions. Grouped results showed that categorization curves for full- and reduced-cue conditions differed significantly in both contrasts. However, an averaging of results obscures marked variability in labeling behavior. Some listeners showed large changes in categorization between the full- and reduced-cue conditions, whereas others showed relatively small or no changes. In a follow-up study, perception of the BAIT-DATE contrast was compared with the perception of a highly stylized BA-DA continuum. A smaller degree of intersubject and between-condition variability was found for these less complex synthetic stimuli. The amount of variability found in the labeling of speech contrasts may be dependent on cue salience, which will be determined by the speech pattern complexity of the stimuli and by the vowel environment.  相似文献   

2.
Most theories of categorization emphasize how continuous perceptual information is mapped to categories. However, equally important are the informational assumptions of a model, the type of information subserving this mapping. This is crucial in speech perception where the signal is variable and context dependent. This study assessed the informational assumptions of several models of speech categorization, in particular, the number of cues that are the basis of categorization and whether these cues represent the input veridically or have undergone compensation. We collected a corpus of 2,880 fricative productions (Jongman, Wayland, & Wong, 2000) spanning many talker and vowel contexts and measured 24 cues for each. A subset was also presented to listeners in an 8AFC phoneme categorization task. We then trained a common classification model based on logistic regression to categorize the fricative from the cue values and manipulated the information in the training set to contrast (a) models based on a small number of invariant cues, (b) models using all cues without compensation, and (c) models in which cues underwent compensation for contextual factors. Compensation was modeled by computing cues relative to expectations (C-CuRE), a new approach to compensation that preserves fine-grained detail in the signal. Only the compensation model achieved a similar accuracy to listeners and showed the same effects of context. Thus, even simple categorization metrics can overcome the variability in speech when sufficient information is available and compensation schemes like C-CuRE are employed.  相似文献   

3.
During speech perception, listeners make judgments about the phonological category of sounds by taking advantage of multiple acoustic cues for each phonological contrast. Perceptual experiments have shown that listeners weight these cues differently. How do listeners weight and combine acoustic cues to arrive at an overall estimate of the category for a speech sound? Here, we present several simulations using a mixture of Gaussians models that learn cue weights and combine cues on the basis of their distributional statistics. We show that a cue‐weighting metric in which cues receive weight as a function of their reliability at distinguishing phonological categories provides a good fit to the perceptual data obtained from human listeners, but only when these weights emerge through the dynamics of learning. These results suggest that cue weights can be readily extracted from the speech signal through unsupervised learning processes.  相似文献   

4.
Speech perception requires listeners to integrate multiple cues that each contribute to judgments about a phonetic category. Classic studies of trading relations assessed the weights attached to each cue but did not explore the time course of cue integration. Here, we provide the first direct evidence that asynchronous cues to voicing (/b/ vs. /p/) and manner (/b/ vs. /w/) contrasts become available to the listener at different times during spoken word recognition. Using the visual world paradigm, we show that the probability of eye movements to pictures of target and of competitor objects diverge at different points in time after the onset of the target word. These points of divergence correspond to the availability of early (voice onset time or formant transition slope) and late (vowel length) cues to voicing and manner contrasts. These results support a model of cue integration in which phonetic cues are used for lexical access as soon as they are available.  相似文献   

5.
The reported research investigates how listeners recognize coarticulated phonemes. First, 2 data sets from experiments on the recognition of coarticulated phonemes published by D. H. Whalen (1989) are reanalyzed. The analyses indicate that listeners used categorization strategies involving a hierarchical dependency. Two new experiments are reported investigating the production and perception of fricative-vowel syllables. On the basis of measurements of acoustic cues on a large set of natural utterances, it was predicted that listeners would use categorization strategies involving a dependency of the fricative categorization on the perceived vowel. The predictions were tested in a perception experiment using a 2-dimensional synthetic fricative-vowel continuum. Model analyses of the results pooled across listeners confirmed the predictions. Individual analyses revealed some variability in the categorization dependencies used by different participants.  相似文献   

6.
Lim SJ  Holt LL 《Cognitive Science》2011,35(7):1390-1405
Although speech categories are defined by multiple acoustic dimensions, some are perceptually weighted more than others and there are residual effects of native-language weightings in non-native speech perception. Recent research on nonlinguistic sound category learning suggests that the distribution characteristics of experienced sounds influence perceptual cue weights: Increasing variability across a dimension leads listeners to rely upon it less in subsequent category learning (Holt & Lotto, 2006). The present experiment investigated the implications of this among native Japanese learning English /r/-/l/ categories. Training was accomplished using a videogame paradigm that emphasizes associations among sound categories, visual information, and players' responses to videogame characters rather than overt categorization or explicit feedback. Subjects who played the game for 2.5h across 5 days exhibited improvements in /r/-/l/ perception on par with 2-4 weeks of explicit categorization training in previous research and exhibited a shift toward more native-like perceptual cue weights.  相似文献   

7.
Speech unfolds over time, and the cues for even a single phoneme are rarely available simultaneously. Consequently, to recognize a single phoneme, listeners must integrate material over several hundred milliseconds. Prior work contrasts two accounts: (a) a memory buffer account in which listeners accumulate auditory information in memory and only access higher level representations (i.e., lexical representations) when sufficient information has arrived; and (b) an immediate integration scheme in which lexical representations can be partially activated on the basis of early cues and then updated when more information arises. These studies have uniformly shown evidence for immediate integration for a variety of phonetic distinctions. We attempted to extend this to fricatives, a class of speech sounds which requires not only temporal integration of asynchronous cues (the frication, followed by the formant transitions 150–350 ms later), but also integration across different frequency bands and compensation for contextual factors like coarticulation. Eye movements in the visual world paradigm showed clear evidence for a memory buffer. Results were replicated in five experiments, ruling out methodological factors and tying the release of the buffer to the onset of the vowel. These findings support a general auditory account for speech by suggesting that the acoustic nature of particular speech sounds may have large effects on how they are processed. It also has major implications for theories of auditory and speech perception by raising the possibility of an encapsulated memory buffer in early auditory processing.  相似文献   

8.
Listeners are able to accurately recognize speech despite variation in acoustic cues across contexts, such as different speaking rates. Previous work has suggested that listeners use rate information (indicated by vowel length; VL) to modify their use of context-dependent acoustic cues, like voice-onset time (VOT), a primary cue to voicing. We present several experiments and simulations that offer an alternative explanation: that listeners treat VL as a phonetic cue rather than as an indicator of speaking rate, and that they rely on general cue-integration principles to combine information from VOT and VL. We demonstrate that listeners use the two cues independently, that VL is used in both naturally produced and synthetic speech, and that the effects of stimulus naturalness can be explained by a cue-integration model. Together, these results suggest that listeners do not interpret VOT relative to rate information provided by VL and that the effects of speaking rate can be explained by more general cue-integration principles.  相似文献   

9.
We examined the effect of perceptual training on a well-established hemispheric asymmetry in speech processing. Eighteen listeners were trained to use a within-category difference in voice onset time (VOT) to cue talker identity. Successful learners (n=8) showed faster response times for stimuli presented only to the left ear than for those presented only to the right. The development of a left-ear/right-hemisphere advantage for processing a prototypically phonetic cue supports a model of speech perception in which lateralization is driven by functional demands (talker identification vs. phonetic categorization) rather than by acoustic stimulus properties alone.  相似文献   

10.
This article is concerned with the question of how listeners recognize coarticulated phonemes. The problem is approached from a pattern classification perspective. First, the potential acoustical effects of coarticulation are defined in terms of the patterns that form the input to a classifier. Next, a categorization model called HICAT is introduced that incorporates hierarchical dependencies to optimally deal with this input. The model allows the position, orientation, and steepness of one phoneme boundary to depend on the perceived value of a neighboring phoneme. It is argued that, if listeners do behave like statistical pattern recognizers, they may use the categorization strategies incorporated in the model. The HICAT model is compared with existing categorization models, among which are the fuzzylogical model of perception and Nearey’s diphone-biased secondary-cue model. Finally, a method is presented by which categorization strategies that are likely to be used by listeners can be predicted from distributions of acoustical cues as they occur in natural speech.  相似文献   

11.
This article is concerned with the question of how listeners recognize coarticulated phonemes. The problem is approached from a pattern classification perspective. First, the potential acoustical effects of coarticulation are defined in terms of the patterns that form the input to a classifier. Next, a categorization model called HICAT is introduced that incorporates hierarchical dependencies to optimally deal with this input. The model allows the position, orientation, and steepness of one phoneme boundary to depend on the perceived value of a neighboring phoneme. It is argued that, if listeners do behave like statistical pattern recognizers, they may use the categorization strategies incorporated in the model. The HICAT model is compared with existing categorization models, among which are the fuzzy-logical model of perception and Nearey's diphone-biased secondary-cue model. Finally, a method is presented by which categorization strategies that are likely to be used by listeners can be predicted from distributions of acoustical cues as they occur in natural speech.  相似文献   

12.
Previous cross-language research has indicated that some speech contrasts present greater perceptual difficulty for adult non-native listeners than others do. It has been hypothesized that phonemic, phonetic, and acoustic factors contribute to this variability. Two experiments were conducted to evaluate systematically the role of phonemic status and phonetic familiarity in the perception of non-native speech contrasts and to test predictions derived from a model proposed by Best, McRoberts, and Sithole (1988). Experiment 1 showed that perception of an unfamiliar phonetic contrast was not less difficult for subjects who had experience with an analogous phonemic distinction in their native language than for subjects without such analogous experience. These results suggest that substantive phonetic experience influences the perception of non-native contrasts, and thus should contribute to a conceptualization of native language-processing skills. In Experiment 2, English listeners’ perception of two related nonphonemic place contrasts was not consistently different as had been expected on the basis of phonetic familiarity. A clear order effect in the perceptual data suggests that interactions between different perceptual assimilation patterns or acoustic properties of the two contrasts, or interactions involving both of these factors, underlie the perception of the two contrasts in this experiment. It was concluded that both phonetic familiarity and acoustic factors are potentially important to the explanation of variability in perception of nonphonemic contrasts. The explanation of how linguistic experience shapes speech perception will require characterizing the relative contribution of these factors, as well as other factors, including individual differences and variables that influence a listener’s orientation to speech stimuli.  相似文献   

13.
Visual information conveyed by iconic hand gestures and visible speech can enhance speech comprehension under adverse listening conditions for both native and non‐native listeners. However, how a listener allocates visual attention to these articulators during speech comprehension is unknown. We used eye‐tracking to investigate whether and how native and highly proficient non‐native listeners of Dutch allocated overt eye gaze to visible speech and gestures during clear and degraded speech comprehension. Participants watched video clips of an actress uttering a clear or degraded (6‐band noise‐vocoded) action verb while performing a gesture or not, and were asked to indicate the word they heard in a cued‐recall task. Gestural enhancement was the largest (i.e., a relative reduction in reaction time cost) when speech was degraded for all listeners, but it was stronger for native listeners. Both native and non‐native listeners mostly gazed at the face during comprehension, but non‐native listeners gazed more often at gestures than native listeners. However, only native but not non‐native listeners' gaze allocation to gestures predicted gestural benefit during degraded speech comprehension. We conclude that non‐native listeners might gaze at gesture more as it might be more challenging for non‐native listeners to resolve the degraded auditory cues and couple those cues to phonological information that is conveyed by visible speech. This diminished phonological knowledge might hinder the use of semantic information that is conveyed by gestures for non‐native compared to native listeners. Our results demonstrate that the degree of language experience impacts overt visual attention to visual articulators, resulting in different visual benefits for native versus non‐native listeners.  相似文献   

14.
This study describes the utilization of acoustic cues in communication of emotions in music performance. Three professional guitarists were asked to perform 3 short melodies to communicate anger, sadness, happiness, and fear to listeners. The resulting performances were analyzed with respect to 5 acoustic cues and judged by 30 listeners on adjective scales. Multiple regression analysis was applied to the relationships between (a) the performer's intention and the cues and (b) the listeners' judgments and the cues. The analyses of performers and listeners were related using C. J. Hursch, K. R. Hammond, and J. L. Hursch's (1964) lens model equation. The results indicated that (a) performers were successful at communicating emotions to listeners, (b) performers' cue utilization was well matched to listeners' cue utilization, and (c) cue utilization was more consistent across different melodies than across different performers. Because of the redundancy of the cues, 2 performers could communicate equally well despite differences in cue utilization.  相似文献   

15.
In noisy situations, visual information plays a critical role in the success of speech communication: listeners are better able to understand speech when they can see the speaker. Visual influence on auditory speech perception is also observed in the McGurk effect, in which discrepant visual information alters listeners’ auditory perception of a spoken syllable. When hearing /ba/ while seeing a person saying /ga/, for example, listeners may report hearing /da/. Because these two phenomena have been assumed to arise from a common integration mechanism, the McGurk effect has often been used as a measure of audiovisual integration in speech perception. In this study, we test whether this assumed relationship exists within individual listeners. We measured participants’ susceptibility to the McGurk illusion as well as their ability to identify sentences in noise across a range of signal-to-noise ratios in audio-only and audiovisual modalities. Our results do not show a relationship between listeners’ McGurk susceptibility and their ability to use visual cues to understand spoken sentences in noise, suggesting that McGurk susceptibility may not be a valid measure of audiovisual integration in everyday speech processing.  相似文献   

16.
Previous cross-language research has indicated that some speech contrasts present greater perceptual difficulty for adult non-native listeners than others do. It has been hypothesized that phonemic, phonetic, and acoustic factors contribute to this variability. Two experiments were conducted to evaluate systematically the role of phonemic status and phonetic familiarity in the perception of non-native speech contrasts and to test predictions derived from a model proposed by Best, McRoberts, and Sithole (1988). Experiment 1 showed that perception of an unfamiliar phonetic contrast was not less difficult for subjects who had experience with an analogous phonemic distinction in their native language than for subjects without such analogous experience. These results suggest that substantive phonetic experience influences the perception of non-native contrasts, and thus should contribute to a conceptualization of native language-processing skills. In Experiment 2, English listeners' perception of two related nonphonemic place contrasts was not consistently different as had been expected on the basis of phonetic familiarity. A clear order effect in the perceptual data suggests that interactions between different perceptual assimilation patterns or acoustic properties of the two contrasts, or interactions involving both of these factors, underlie the perception of the two contrasts in this experiment. It was concluded that both phonetic familiarity and acoustic factors are potentially important to the explanation of variability in perception of nonphonemic contrasts. The explanation of how linguistic experience shapes speech perception will require characterizing the relative contribution of these factors, as well as other factors, including individual differences and variables that influence a listener's orientation to speech stimuli.  相似文献   

17.
18.
19.
In an eye movement experiment, we assessed the performance of young (18–30 years) and older (65 + years) adult readers when sentences contained conventional interword spaces, when interword spaces were removed, or when interword spaces were replaced by nonlinguistic symbols. The replacement symbol was either a closed square (■) that provided a salient (low-spatial-frequency) cue to word boundaries, or an open square (□) that provided a less salient cue and included features (vertical and horizontal lines) similar to those found in letters. Removing or replacing interword spaces slowed reading times and impaired normal eye movement behavior for both age groups. However, this disruption was greater for the older readers, particularly when the replacement symbol did not provide a salient cue as to word boundaries. Specific influences of this manipulation on word identification during reading were assessed by examining eye movements for a high- or low-frequency target word in each sentence. Standard word frequency effects were obtained for both age groups when text was spaced normally, and although the word frequency effect was larger when spaces were removed or filled, the increases were similar across age groups. Therefore, whereas older adults’ normal eye movements were substantially disrupted when text lacked conventional interword spaces, the process of lexical access associated with the word frequency effect was no more difficult for older than for young adults. The indication, therefore, is that although older adults struggle from the loss of conventional cues to word boundaries, this is not due to additional difficulties in word recognition.  相似文献   

20.
Two talkers' productions of the same phoneme may be quite different acoustically, whereas their productions of different speech sounds may be virtually identical. Despite this lack of invariance in the relationship between the speech signal and linguistic categories, listeners experience phonetic constancy across a wide range of talkers, speaking styles, linguistic contexts, and acoustic environments. The authors present evidence that perceptual sensitivity to talker variability involves an active cognitive mechanism: Listeners expecting to hear 2 different talkers differing only slightly in average pitch showed performance costs typical of adjusting to talker variability, whereas listeners hearing the same materials but expecting a single talker or given no special instructions did not show these performance costs. The authors discuss the implications for understanding phonetic constancy despite variability between talkers (and other sources of variability) and for theories of speech perception. The results provide further evidence for active, controlled processing in real-time speech perception and are consistent with a model of talker normalization that involves contextual tuning.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号