首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Corpus‐based word frequencies are one of the most important predictors in language processing tasks. Frequencies based on conversational corpora (such as movie subtitles) are shown to better capture the variance in lexical decision tasks compared to traditional corpora. In this study, we show that frequencies computed from social media are currently the best frequency‐based estimators of lexical decision reaction times (up to 3.6% increase in explained variance). The results are robust (observed for Twitter‐ and Facebook‐based frequencies on American English and British English datasets) and are still substantial when we control for corpus size.  相似文献   

2.
We present a new database of Dutch word frequencies based on film and television subtitles, and we validate it with a lexical decision study involving 14,000 monosyllabic and disyllabic Dutch words. The new SUBTLEX frequencies explain up to 10% more variance in accuracies and reaction times (RTs) of the lexical decision task than the existing CELEX word frequency norms, which are based largely on edited texts. As is the case for English, an accessibility measure based on contextual diversity explains more of the variance in accuracy and RT than does the raw frequency of occurrence counts. The database is freely available for research purposes and may be downloaded from the authors’ university site at http://crr.ugent.be/subtlex-nl or from http://brm psychonomic-journals.org/content/supplemental.  相似文献   

3.
Word frequency is the most important variable in research on word processing and memory. Yet, the main criterion for selecting word frequency norms has been the availability of the measure, rather than its quality. As a result, much research is still based on the old Kučera and Francis frequency norms. By using the lexical decision times of recently published megastudies, we show how bad this measure is and what must be done to improve it. In particular, we investigated the size of the corpus, the language register on which the corpus is based, and the definition of the frequency measure. We observed that corpus size is of practical importance for small sizes (depending on the frequency of the word), but not for sizes above 16–30 million words. As for the language register, we found that frequencies based on television and film subtitles are better than frequencies based on written sources, certainly for the monosyllabic and bisyllabic words used in psycholinguistic research. Finally, we found that lemma frequencies are not superior to word form frequencies in English and that a measure of contextual diversity is better than a measure based on raw frequency of occurrence. Part of the superiority of the latter is due to the words that are frequently used as names. Assembling a new frequency norm on the basis of these considerations turned out to predict word processing times much better than did the existing norms (including Kučera & Francis and Celex). The new SUBTL frequency norms from the SUBTLEXUS corpus are freely available for research purposes from http://brm.psychonomic-journals.org/content/supplemental, as well as from the University of Ghent and Lexique Web sites.  相似文献   

4.
In a critical review of the heuristics used to deal with zero word frequencies, we show that four are suboptimal, one is good, and one may be acceptable. The four suboptimal strategies are discarding words with zero frequencies, giving words with zero frequencies a very low frequency, adding 1 to the frequency per million, and making use of the Good–Turing algorithm. The good algorithm is the Laplace transformation, which consists of adding 1 to each frequency count and increasing the total corpus size by the number of word types observed. A strategy that may be acceptable is to guess the frequency of absent words on the basis of other corpora and then increasing the total corpus size by the estimated summed frequency of the missing words. A comparison with the lexical decision times of the English Lexicon Project and the British Lexicon Project suggests that the Laplace transformation gives the most useful estimates (in addition to being easy to calculate). Therefore, we recommend it to researchers.  相似文献   

5.
6.
Recent research on anagram solution has produced two original findings. First, it has shown that a new bigram frequency measure called top rank, which is based on a comparison of summed bigram frequencies, is an important predictor of anagram difficulty. Second, it has suggested that the measures from a type count are better than token measures at predicting anagram difficulty.Testing these hypotheses has been difficult because the computation of the bigram statistics is difficult. We present a program that calculates bigram measures for two-to nine-letter words. We then show how the program can be used to compare the contribution of top rank and other bigram frequency measures derived from both a token and a type count. Contrary to previous research, we report that type measures are not better at predicting anagram solution times and that top rank is not the best predictor of anagram difficulty. Lastly we use this program to show that type bigram frequencies are not as good as token bigram frequencies at predicting word identification reaction time.  相似文献   

7.
In this article, we present a new lexical database for French: Lexique. In addition to classical word information such as gender, number, and grammatical category, Lexique includes a series of interesting new characteristics. First, word frequencies are based on two cues: a contemporary corpus of texts and the number of Web pages containing the word. Second, the database is split into a graphemic table with all the relevant frequencies, a table structured around lemmas (particularly interesting for the study of the inflectional family), and a table about surface frequency cues. Third, Lexique is distributed under a GNU-like license, allowing people to contribute to it. Finally, a metasearch engine, Open Lexique, has been developed so that new databases can be added very easily to the existing ones. Lexique can either be downloaded or interrogated freely from http://www.lexique.org.  相似文献   

8.
The SUBTLEX-US corpus has been parsed with the CLAWS tagger, so that researchers have information about the possible word classes (parts?\of?\speech, or PoSs) of the entries. Five new columns have been added to the SUBTLEX-US word frequency list: the dominant (most frequent) PoS for the entry, the frequency of the dominant PoS, the frequency of the dominant PoS relative to the entry??s total frequency, all PoSs observed for the entry, and the respective frequencies of these PoSs. Because the current definition of lemma frequency does not seem to provide word recognition researchers with useful information (as illustrated by a comparison of the lemma frequencies and the word form frequencies from the Corpus of Contemporary American English), we have not provided a column with this variable. Instead, we hope that the full list of PoS frequencies will help researchers to collectively determine which combination of frequencies is the most informative.  相似文献   

9.
Written word frequency (e.g., Francis & Ku6era, 1982; Kucera & Francis, 1967) constitutes apopular measure of word familiarity, which is highly predictive of word recognition. Far less often, researchers employ spoken frequency counts in their studies. This discrepancy can be attributed most readily to the conspicuous absence of a sizeable spoken frequency count for American English. The present article reports the construction of a 1.6-million-word spoken frequency database derived from the Michigan Corpus of Academic Spoken English (Simpson, Swales, & Briggs, 2002). We generated spoken frequency counts for 34,922 words and extracted speaker attributes from the source material to generate relative frequencies of words spoken by each speaker category. We assessthe predictive validity of these counts, and discuss some possible applications outside of word recognition studies.  相似文献   

10.
WORD FREQUENCY AND WORD DIFFICULTY:   总被引:1,自引:0,他引:1  
Abstract— This article compares word counts made using four different collections of text, including one based on collections of electronic text For each of the collections, standard word frequency indices were computed and compared with a carefully developed list of words ranked in order of difficulty as determined by vocabulary tests Correlations between the word frequency indices and word difficulty ranks show that word frequencies for all four corpora are highly correlated with word difficulty Despite these high correlations, the results show also that the difficulty of some words is not estimated accurately by word frequency The reasons for disparities between word frequency and word difficulty are not clear The high correlations obtained for the corpus based on electronic text suggest that this method of text sampling has potential but that caution is advisable in conducting such collections.  相似文献   

11.
In recent years, a considerable number of studies have tried to establish which characteristics of objects and their names predict the responses of patients with Alzheimer's disease (AD) in the picture-naming task. The frequency of use of words and their age of acquisition (AoA) have been implicated as two of the most influential variables, with naming being best preserved for objects with high-frequency, early-acquired names. The present study takes a fresh look at the predictors of naming success in Spanish and English AD patients using a range of measures of word frequency and AoA along with visual complexity, imageability, and word length as predictors. Analyses using generalized linear mixed modelling found that naming accuracy was better predicted by AoA ratings taken from older adults than conventional ratings from young adults. Older frequency measures based on written language samples predicted accuracy better than more modern measures based on the frequencies of words in film subtitles. Replacing adult frequency with an estimate of cumulative (lifespan) frequency did not reduce the impact of AoA. Semantic error rates were predicted by both written word frequency and senior AoA while null response errors were only predicted by frequency. Visual complexity, imageability, and word length did not predict naming accuracy or errors.  相似文献   

12.
Prior research has shown that people can learn many nouns (i.e., word–object mappings) from a short series of ambiguous situations containing multiple words and objects. For successful cross‐situational learning, people must approximately track which words and referents co‐occur most frequently. This study investigates the effects of allowing some word‐referent pairs to appear more frequently than others, as is true in real‐world learning environments. Surprisingly, high‐frequency pairs are not always learned better, but can also boost learning of other pairs. Using a recent associative model (Kachergis, Yu, & Shiffrin, 2012), we explain how mixing pairs of different frequencies can bootstrap late learning of the low‐frequency pairs based on early learning of higher frequency pairs. We also manipulate contextual diversity, the number of pairs a given pair appears with across training, since it is naturalistically confounded with frequency. The associative model has competing familiarity and uncertainty biases, and their interaction is able to capture the individual and combined effects of frequency and contextual diversity on human learning. Two other recent word‐learning models do not account for the behavioral findings.  相似文献   

13.
Researchers often require subjects to make judgments that call upon their knowledge of the orthographic structure of English words. Such knowledge is relevant in experiments on, for example, reading, lexical decision, and anagram solution. One common measure of orthographic structure is the sum of the frequencies of consecutive bigrams in the word. Traditionally, researchers have relied on token-based norms of bigram frequencies. These norms confound bigram frequency with word frequency because each instance (i.e., token) of a particular word in a corpus of running text increments the frequencies of the bigrams that it contains. In this article, the authors report a set of type-based bigram frequencies in which each word (i.e., type) contributes only once, thereby unconfounding bigram frequency from word frequency. The authors show that type-based bigram frequency is a better predictor of the difficulty of anagram solution than is token-based frequency. These norms can be downloaded from www.psychonomic.org/archive/.  相似文献   

14.
In this article, we present a new lexical database for French:Lexique. In addition to classical word information such as gender, number, and grammatical category,Lexique includes a series of interesting new characteristics. First, word frequencies are based on two cues: a contemporary corpus of texts and the number of Web pages containing the word. Second, the database is split into a graphemic table with all the relevant frequencies, a table structured around lemmas (particularly interesting for the study of the inflectional family), and a table about surface frequency cues. Third,Lexique is distributed under a GNU-like license, allowing people to contribute to it. Finally, a metasearch engine,Open Lexique, has been developed so that new databases can be added very easily to the existing ones.Lexique can either be downloaded or interrogated freely fromhttp://www.lexique.org.  相似文献   

15.
词频效应指语言产生中人们对高频词汇的加工比低频词汇更快更准确的一种现象, 它可能发生在语言产生中的不同阶段。对青年人和老年人词频效应的不同特点和加工机制进行比较, 可以考察语言产生的认知老化机制。通过语言产生理论可对词频效应的老化进行预测, 提出词频效应在个体发展和老化阶段的相对稳定性, 分析老化导致词频效应相关的神经基础和加工时间进程的改变。未来研究可进一步分离词频效应与习得年龄效应对语言产生老化的影响, 并扩展至神经退行性疾病患者中。  相似文献   

16.
In this article, we present a new lexical database for Modern Standard Arabic: Aralex. Based on a contemporary text corpus of 40 million words, Aralex provides information about (1) the token frequencies of roots and word patterns, (2) the type frequency, or family size, of roots and word patterns, and (3) the frequency of bigrams, trigrams in orthographic forms, roots, and word patterns. Aralex will be a useful tool for studying the cognitive processing of Arabic through the selection of stimuli on the basis of precise frequency counts. Researchers can use it as a source of information on natural language processing, and it may serve an educational purpose by providing basic vocabulary lists. Aralex is distributed under a GNU-like license, allowing people to interrogate it freely online or to download it from www.mrc-cbu.cam.ac.uk:8081/aralex .online/login.jsp.  相似文献   

17.
Background: When constructing stimuli for experimental investigations of cognitive processes in early reading development, researchers have to rely on adult or American children's word frequency counts, as no such counts exist for English children. Aim: The present paper introduces a database of children's early reading vocabulary, for use by researchers and teachers. Sample: Texts from 685 books from reading schemes and story books read by 5‐7 year‐old children were used in the construction of the database. Method: All words from the 685 books were typed or scanned into an Oracle database. Results: The resulting up‐to‐date word frequency list of early print exposure in the UK is available in two forms from a website address given in this paper. This allows access to one list of the words ordered alphabetically and one list of the words ordered by frequency. We also briefly address some fundamental issues underlying early reading vocabulary (e.g., that it is heavily skewed towards low frequencies). Other characteristics of the vocabulary are then discussed. Conclusions: We hope the word frequency lists will be of use to researchers seeking to control word frequency, and to teachers interested in the vocabulary to which young children are exposed in their reading material.  相似文献   

18.
Word stem completion tasks involve showing participants a number of words and then later asking them to complete word stems to make a full word. If the stem is completed with one of the studied words, it indicates memory. It is a test widely used to assess both implicit and explicit forms of memory. An important aspect of stimulus selection is that target words should not frequently be generated spontaneously from the word stem, to ensure that production of the word really represents memory. In this article, we present a database of spontaneous stem completion rates for 395 stems from a group of 80 British undergraduate psychology students. It includes information on other characteristics of the words (word frequency, concreteness, imageability, age of acquisition, common part of speech, and number of letters) and, as such, can be used to select suitable words to include in a stem completion task. Supplemental materials for this article may be downloaded from http://brm .psychonomic-journals.org/content/supplemental.  相似文献   

19.
Researchers often require subjects to make judgments that call upon their knowledge of the orthographic structure of English words. Such knowledge is relevant in experiments on, for example, reading, lexical decision, and anagram solution. One common measure of orthographic structure is the sum of the frequencies of consecutive bigrams in the word. Traditionally, researchers have relied ontoken-based norms of bigram frequencies. These norms confound bigram frequency with word frequency because each instance (i.e., token) of a particular word in a corpus of running text increments the frequencies of the bigrams that it contains. In this article, the authors report a set oftype-based bigram frequencies in which each word (i.e., type) contributes only once, thereby unconfounding bigram frequency from word frequency. The authors show that type-based bigram frequency is a better predictor of the difficulty of anagram solution than is token-based frequency. These norms can be downloaded fromwww.psychonomic.org/archive/.  相似文献   

20.
In this article, we present Procura-PALavras (P-PAL), a Web-based interface for a new European Portuguese (EP) lexical database. Based on a contemporary printed corpus of over 227 million words, P-PAL provides a broad range of word attributes and statistics, including several measures of word frequency (e.g., raw counts, per-million word frequency, logarithmic Zipf scale), morpho-syntactic information (e.g., parts of speech [PoSs], grammatical gender and number, dominant PoS, and frequency and relative frequency of the dominant PoS), as well as several lexical and sublexical orthographic (e.g., number of letters; consonant–vowel orthographic structure; density and frequency of orthographic neighbors; orthographic Levenshtein distance; orthographic uniqueness point; orthographic syllabification; and trigram, bigram, and letter type and token frequencies), and phonological measures (e.g., pronunciation, number of phonemes, stress, density and frequency of phonological neighbors, transposed and phonographic neighbors, syllabification, and biphone and phone type and token frequencies) for ~53,000 lemmatized and ~208,000 nonlemmatized EP word forms. To obtain these metrics, researchers can choose between two word queries in the application: (i) analyze words previously selected for specific attributes and/or lexical and sublexical characteristics, or (ii) generate word lists that meet word requirements defined by the user in the menu of analyses. For the measures it provides and the flexibility it allows, P-PAL will be a key resource to support research in all cognitive areas that use EP verbal stimuli. P-PAL is freely available at http://p-pal.di.uminho.pt/tools.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号