首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this article, we present a new lexical database for Modern Standard Arabic: Aralex. Based on a contemporary text corpus of 40 million words, Aralex provides information about (1) the token frequencies of roots and word patterns, (2) the type frequency, or family size, of roots and word patterns, and (3) the frequency of bigrams, trigrams in orthographic forms, roots, and word patterns. Aralex will be a useful tool for studying the cognitive processing of Arabic through the selection of stimuli on the basis of precise frequency counts. Researchers can use it as a source of information on natural language processing, and it may serve an educational purpose by providing basic vocabulary lists. Aralex is distributed under a GNU-like license, allowing people to interrogate it freely online or to download it from www.mrc-cbu.cam.ac.uk:8081/aralex .online/login.jsp.  相似文献   

2.
3.
To conduct experimental investigations into the orthographic processing of Modern Greek, information is needed about the lexical properties known to influence visual word recognition. In this article we introduce GreekLex, a lexical database for Modern Greek, which presents collectively for the first time a series of orthographic measures that can be used for psycholinguistic research. GreekLex consists of 35,304 Modern Greek words ranging in length from 1 to 22 letters, and for each word includes the following statistical information: word length, word-form frequency, lemma frequency, neighborhood density and frequency, transposition neighbors, and addition and deletion neighbors. Furthermore, type and token frequency measures of single letters and bigrams derived from the database are also available. The complete database can be accessed and downloaded freely from www.psychology.nottingham.ac.uk/GreekLex.  相似文献   

4.
In this article, we present a new lexical database for French:Lexique. In addition to classical word information such as gender, number, and grammatical category,Lexique includes a series of interesting new characteristics. First, word frequencies are based on two cues: a contemporary corpus of texts and the number of Web pages containing the word. Second, the database is split into a graphemic table with all the relevant frequencies, a table structured around lemmas (particularly interesting for the study of the inflectional family), and a table about surface frequency cues. Third,Lexique is distributed under a GNU-like license, allowing people to contribute to it. Finally, a metasearch engine,Open Lexique, has been developed so that new databases can be added very easily to the existing ones.Lexique can either be downloaded or interrogated freely fromhttp://www.lexique.org.  相似文献   

5.
In an auditory lexical decision experiment, 5541 spoken content words and pseudowords were presented to 20 native speakers of Dutch. The words vary in phonological make-up and in number of syllables and stress pattern, and are further representative of the native Dutch vocabulary in that most are morphologically complex, comprising two stems or one stem plus derivational and inflectional suffixes, with inflections representing both regular and irregular paradigms; the pseudowords were matched in these respects to the real words. The BALDEY (“biggest auditory lexical decision experiment yet”) data file includes response times and accuracy rates, with for each item morphological information plus phonological and acoustic information derived from automatic phonemic segmentation of the stimuli. Two initial analyses illustrate how this data set can be used. First, we discuss several measures of the point at which a word has no further neighbours and compare the degree to which each measure predicts our lexical decision response outcomes. Second, we investigate how well four different measures of frequency of occurrence (from written corpora, spoken corpora, subtitles, and frequency ratings by 75 participants) predict the same outcomes. These analyses motivate general conclusions about the auditory lexical decision task. The (publicly available) BALDEY database lends itself to many further analyses.  相似文献   

6.
7.
Malay, a language spoken by 250 million people, has a shallow alphabetic orthography, simple syllable structures, and transparent affixation—characteristics that contrast sharply with those of English. In the present article, we first compare the letter—phoneme and letter—syllable ratios for a sample of alphabetic orthographies to highlight the importance of separating language-specific from language-universal reading processes. Then, in order to develop a better understanding of word recognition in orthographies with more consistent mappings to phonology than English, we compiled a database of lexical variables (letter length, syllable length, phoneme length, morpheme length, word frequency, orthographic and phonological neighborhood sizes, and orthographic and phonological Levenshtein distances) for 9,592 Malay words. Separate hierarchical regression analyses for Malay and English revealed how the consistency of orthography—phonology mappings selectively modulates the effects of different lexical variables on lexical decision and speeded pronunciation performance. The database of lexical and behavioral measures for Malay is available at http://brm.psychonomic-journals.org/content/ supplemental.  相似文献   

8.
This article presents MANULEX, a Web-accessible database that provides grade-level word frequency lists of nonlemmatized and lemmatized words (48,886 and 23,812 entries, respectively) computed from the 1.9 million words taken from 54 French elementary school readers. Word frequencies are provided for four levels: first grade (G1), second grade (G2), third to fifth grades (G3-5), and all grades (G1-5). The frequencies were computed following the methods describedby Carroll, Davies, and Richman (1971) and Zeno, Ivenz, Millard, and Duwuri (1995), with four statistics at each level (F, overall word frequency;D, index of dispersion across the selectedreaders;U, estimated frequencyper million words; andSFI, standard frequency index). The database also provides the number of letters in the word and syntactic category information. MANULEX is intended to be a useful tool for studying language development through the selection of stimuli based on precise frequency norms. Researchers in artificial intelligence can also use it as a source of information on natural language processing to simulate written language acquisition in children. Finally, it may serve an educational purpose by providing basic vocabulary lists.  相似文献   

9.
The LEXIN database offers psycholinguistic indexes of the 13,184 different words (types) computed from 178,839 occurrences of these words (tokens) contained in a corpus of 134 beginning readers widely used in Spain. This database provides four statistical indicators: F (overall word frequency), D (index of dispersion across selected readers), U (estimated frequency per million words), and SFI (standard frequency index). It also gives information about the number of letters, syntactic category, and syllabic structure of the words included. To facilitate comparisons, LEXIN provides data from LEXESP’s (Sebastián-Gallés, Martí, Cuetos, & Carreiras, 2000), Alameda and Cuetos’s (1995), and Martínez and García’s (2004) Spanish adult psycholinguistic frequency databases. Access to the LEXIN database is facilitated by a computer program. The LEXIN program allows for the creation of word lists by letting the user specify searching criteria. LEXIN can be useful for researchers in cognitive psychology, particularly in the areas of psycholinguistics and education.  相似文献   

10.
In this article, we present Procura-PALavras (P-PAL), a Web-based interface for a new European Portuguese (EP) lexical database. Based on a contemporary printed corpus of over 227 million words, P-PAL provides a broad range of word attributes and statistics, including several measures of word frequency (e.g., raw counts, per-million word frequency, logarithmic Zipf scale), morpho-syntactic information (e.g., parts of speech [PoSs], grammatical gender and number, dominant PoS, and frequency and relative frequency of the dominant PoS), as well as several lexical and sublexical orthographic (e.g., number of letters; consonant–vowel orthographic structure; density and frequency of orthographic neighbors; orthographic Levenshtein distance; orthographic uniqueness point; orthographic syllabification; and trigram, bigram, and letter type and token frequencies), and phonological measures (e.g., pronunciation, number of phonemes, stress, density and frequency of phonological neighbors, transposed and phonographic neighbors, syllabification, and biphone and phone type and token frequencies) for ~53,000 lemmatized and ~208,000 nonlemmatized EP word forms. To obtain these metrics, researchers can choose between two word queries in the application: (i) analyze words previously selected for specific attributes and/or lexical and sublexical characteristics, or (ii) generate word lists that meet word requirements defined by the user in the menu of analyses. For the measures it provides and the flexibility it allows, P-PAL will be a key resource to support research in all cognitive areas that use EP verbal stimuli. P-PAL is freely available at http://p-pal.di.uminho.pt/tools.  相似文献   

11.
In this article, we present a new lexical database for French: Lexique. In addition to classical word information such as gender, number, and grammatical category, Lexique includes a series of interesting new characteristics. First, word frequencies are based on two cues: a contemporary corpus of texts and the number of Web pages containing the word. Second, the database is split into a graphemic table with all the relevant frequencies, a table structured around lemmas (particularly interesting for the study of the inflectional family), and a table about surface frequency cues. Third, Lexique is distributed under a GNU-like license, allowing people to contribute to it. Finally, a metasearch engine, Open Lexique, has been developed so that new databases can be added very easily to the existing ones. Lexique can either be downloaded or interrogated freely from http://www.lexique.org.  相似文献   

12.
In this article, we introduce ESCOLEX, the first European Portuguese children’s lexical database with grade-level-adjusted word frequency statistics. Computed from a 3.2-million-word corpus, ESCOLEX provides 48,381 word forms extracted from 171 elementary and middle school textbooks for 6- to 11-year-old children attending the first six grades in the Portuguese educational system. Like other children’s grade-level databases (e.g., Carroll, Davies, & Richman, 1971; Corral, Ferrero, & Goikoetxea, Behavior Research Methods, 41, 1009–1017, 2009; Lété, Sprenger-Charolles, & Colé, Behavior Research Methods, Instruments, & Computers, 36, 156–166, 2004; Zeno, Ivens, Millard, Duvvuri, 1995), ESCOLEX provides four frequency indices for each grade: overall word frequency (F), index of dispersion across the selected textbooks (D), estimated frequency per million words (U), and standard frequency index (SFI). It also provides a new measure, contextual diversity (CD). In addition, the number of letters in the word and its part(s) of speech, number of syllables, syllable structure, and adult frequencies taken from P-PAL (a European Portuguese corpus-based lexical database; Soares, Comesaña, Iriarte, Almeida, Simões, Costa, …, Machado, 2010; Soares, Iriarte, Almeida, Simões, Costa, França, …, Comesaña, in press) are provided. ESCOLEX will be a useful tool both for researchers interested in language processing and development and for professionals in need of verbal materials adjusted to children’s developmental stages. ESCOLEX can be downloaded along with this article or from http://p-pal.di.uminho.pt/about/databases.  相似文献   

13.
In this article, we present StimulStat – a lexical database for the Russian language in the form of a web application. The database contains more than 52,000 of the most frequent Russian lemmas and more than 1.7 million word forms derived from them. These lemmas and forms are characterized according to more than 70 properties that were demonstrated to be relevant for psycholinguistic research, including frequency, length, phonological and grammatical properties, orthographic and phonological neighborhood frequency and size, grammatical ambiguity, homonymy and polysemy. Some properties were retrieved from various dictionaries and are presented collectively in a searchable form for the first time, the others were computed specifically for the database. The database can be accessed freely at http://stimul.cognitivestudies.ru.  相似文献   

14.
15.
This article presents MANULEX, a Web-accessible database that provides grade-level word frequency lists of nonlemmatized and lemmatized words (48,886 and 23,812 entries, respectively) computed from the 1.9 million words taken from 54 French elementary school readers. Word frequencies are provided for four levels: first grade (G1), second grade (G2), third to fifth grades (G3-5), and all grades (G1-5). The frequencies were computed following the methods described by Carroll, Davies, and Richman (1971) and Zeno, Ivenz, Millard, and Duvvuri (1995), with four statistics at each level (F, overall word frequency; D, index of dispersion across the selected readers; U, estimated frequency per million words; and SFI, standard frequency index). The database also provides the number of letters in the word and syntactic category information. MANULEX is intended to be a useful tool for studying language development through the selection of stimuli based on precise frequency norms. Researchers in artificial intelligence can also use it as a source of information on natural language processing to simulate written language acquisition in children. Finally, it may serve an educational purpose by providing basic vocabulary lists.  相似文献   

16.
In this article, we present a database of orthographic neighbors for words that Spanish children read during elementary education. The reference dictionary for lexical entries and frequencies (which had its origin in Martínez & García, 2004) comprises approximately 100,000 words and is the result of accumulating the words read by a sample of children from first to sixth grades. Using the criterion for orthographic neighbors described by Coltheart, Davelaar, Jonasson, and Besner (1977), we present basic statistics related to neighborhood size as a function of the positions of divergent letters, the cumulative frequency of the neighbors, and the numbers of neighbors of higher, lower, and equal frequency. We also attempt to illustrate and unravel the nature of the relationships among the variables neighborhood size, length, and frequency in the distribution of neighbors. The database described in this article is available at www.psychonomic.org/archive.  相似文献   

17.
18.
In this study we present a self-organizing connectionist model of early lexical development. We call this model DevLex-II, based on the earlier DevLex model. DevLex-II can simulate a variety of empirical patterns in children's acquisition of words. These include a clear vocabulary spurt, effects of word frequency and length on age of acquisition, and individual differences as a function of phonological short-term memory and associative capacity. Further results from lesioned models indicate developmental plasticity in the network's recovery from damage, in a non-monotonic fashion. We attribute the network's abilities in accounting for lexical development to interactive dynamics in the learning process. In particular, variations displayed by the model in the rate and size of early vocabulary development are modulated by (a) input characteristics, such as word frequency and word length, (b) consolidation of lexical-semantic representation, meaning-form association, and phonological short-term memory, and (c) delayed processes due to interactions among timing, severity, and recoverability of lesion. Together, DevLex and DevLex-II provide an accurate computational account of early lexical development.  相似文献   

19.
In this article, we introduce HelexKids, an online written-word database for Greek-speaking children in primary education (Grades 1 to 6). The database is organized on a grade-by-grade basis, and on a cumulative basis by combining Grade 1 with Grades 2 to 6. It provides values for Zipf, frequency per million, dispersion, estimated word frequency per million, standard word frequency, contextual diversity, orthographic Levenshtein distance, and lemma frequency. These values are derived from 116 textbooks used in primary education in Greece and Cyprus, producing a total of 68,692 different word types. HelexKids was developed to assist researchers in studying language development, educators in selecting age-appropriate items for teaching, as well as writers and authors of educational books for Greek/Cypriot children. The database is open access and can be searched online at www.helexkids.org.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号