首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper discusses the influence of test difficulty on the correlation between test items and between tests. The greater the difference in difficulty between two test items or between two tests the smaller the maximum correlation between them. In general, the greater the number of degrees of difficulty among the items in a test or among the tests in a battery, the higher the rank of the matrix of intercorrelations; that is, differences in difficulty are represented in the factorial configuration as additional factors. The suggestion is made that if all tests included in a battery are roughly homogeneous with respect to difficulty existing hierarchies will be more clearly defined and meaningful psychological interpretation of factors more readily attained.  相似文献   

2.
An equation is derived for predicting the effect of chance success, relative to item difficulty, on item-test correlation. The values predicted by this equation and by equations derived by Guilford and Carroll for predicting the effect of chance success on item difficulty and test reliability are compared with empirical values in an experiment which used identical test items in multiple-choice and answer-only form.Condensation of a dissertation presented in partial fulfillment of the requirements for the Ph.D. degree to the University of Chicago. Grateful acknowledgment is made to Professor Harold Gulliksen for his guidance as thesis advisor and to Professor L. L. Thurstone and Dr. D. W. Fiske of the University of Chicago who served as members of the thesis committee. The author is also indebted to Professor S. S. Wilks for review of the derivations and development of statistical tests used in the thesis, to Dr. L. R Tucker for technical advice, and to Dr. W. G. Mollenkopf for critical comments on the derivations and interpretations. The writer expresses appreciation to the Educational Testing Service for making available its technical facilities, and to the University of Chicago for the flexible administrative arrangement which made this thesis possible.  相似文献   

3.
刘玥  刘红云 《心理科学》2015,(6):1504-1512
研究旨在探索无铆题情况下,使用构造铆测验法,实现测验分数等值。研究一和研究二分别探索题目难度排序错误、铆题难度差异对构造铆测验法的影响。结果表明:(1)等组条件下,随着错误铆题比例,难度排序错误程度,铆题难度差异增大,构造铆测验法的等值误差逐渐增大,随机等组法的等值误差较为稳定;不等组条件下,构造铆测验法的等值误差均小于随机等组法;(2)对于构造铆测验法,在不等组条件下,铆测验长度越短,等值误差越大。  相似文献   

4.
Under assumptions that will hold for the usual test situation, it is proved that test reliability and variance increase (a) as the average inter-item correlation increases, and (b) as the variance of the item difficulty distribution decreases. As the average item variance increases, the test variance will increase, but the test reliability will not be affected. (It is noted that as the average item variance increases, the average item difficulty approaches .50). In this development, no account is taken of the effect of chance success, or the possible effect on student attitude of different item difficulty distributions. In order to maximize the reliability and variance of a test, the items should have high intercorrelations, all items should be of the same difficulty level, and the level should be as near to 50% as possible.The desirability of determining this relationship has been indicated by previous writers. Work on the present paper arose out of some problems raised by Dr. Herbert S. Conrad in connection with an analysis of aptitude tests.On leave for Government war research from the Psychology Department, University of Chicago.  相似文献   

5.
Analyses at the level of individual items were conducted on 11 data sets representing various combinations of participant samples and tests of reasoning. The magnitude of the relations between age and solution accuracy did not vary systematically across a wide range of item difficulty, although there was some evidence for independent age-related influences on the more difficult items. The results were tentatively interpreted as reflecting the operation of at least 2 types of age-related effects on tests of reasoning, 1 common to all items and 1 sensitive to the greater processing demands associated with more difficult items.  相似文献   

6.
HORST P 《Psychometrika》1951,16(2):189-202
Having given a fixed amount of total testing time it is important to know how long each test in the battery should be so that the correlation of the battery with the criterion will be a maximum. The precise solution for the test lengths will depend on a particular set of conditions which may be specified. The writer has previously presented solutions for two sets of conditions. This article presents the solution for a third set of conditions. These are: (1) The total number of items or testing time is fixed. (2) The score is the total number of items correctly answered. (3) The test lengths are determined in such a way that the correlation of total score with the criterion is a maximum. The solutions for the two previous sets of conditions, together with the current set, are summarized. A set of experimental data is submitted to each solution and the three sets of results are compared.  相似文献   

7.
A dilemma was created for factor analysts by Ferguson (Psychometrika, 1941,6, 323–329) when he demonstrated that test items or sub-tests of varying difficulty will yield a correlation matrix of rank greater than 1, even though the material from which the items or sub-tests are drawn is homogeneous, although homogeneity of such material had been defined operationally by factor analysts as having a correlation matrix of rank 1. This dilemma has been resolved as a case of ambiguity, which lay in (1) failure to specify whether homogeneity was to apply to content, difficulty, or both, and (2) failure to state explicitly the kind of correlation to be used in obtaining the matrix. It is demonstrated that (1) if the material but (2) if content is homogeneous but difficulty is not, the homogeneity of the content can be demonstrated only by using the tetrachoric correlation coefficient in deriving the matrix; and that the use of the phi-coefficient (Pearsonianr) will disclose only the nonhomogeneity of the difficulty and lead to a series ofconstant error factors as contrasted withcontent factors. Since varying difficulty of items (and possibly of sub-tests) is desirable as well as practically unavoidable, it is recommended that all factor analysis problems be carried out with tetrachoric correlations. While no one would want to obtain the constant error factors by factor analysis (difficulty being more easily obtained by counting passes), their importance for test construction is pointed out.  相似文献   

8.
The relation between item difficulty distributions and the validity and reliability of tests is computed through use of normal correlation surfaces for varying numbers of items and varying degrees of item intercorrelations. Optimal or near optimal item difficulty distributions are thus identified for various possible item difficulty distributions. The results indicate that, if a test is of conventional length, is homogeneous as to content, and has a symmetrical distribution of item difficulties, correlation with a normally distributed perfect measure of the attribute common to the items does not vary appreciably with variation in the item difficulty distribution. Greater variation was evident in correlation with a second duplicate test (reliability). The general implications of these findings and their particular significance for evaluating techniques aimed at increasing reliability are considered.  相似文献   

9.
Paul Horst 《Psychometrika》1954,19(4):291-296
A formula is derived which gives the maximum expected correlation between two multiple-choice tests as a function of the distributions of proportions correct for the items in the two tests and the probability of chance success. The formula is similar to one derived by Carroll based on true item difficulties. A numerical example is provided.  相似文献   

10.
Wendy M. Yen 《Psychometrika》1985,50(4):399-410
When the three-parameter logistic model is applied to tests covering a broad range of difficulty, there frequently is an increase in mean item discrimination and a decrease in variance of item difficulties and traits as the tests become more difficult. To examine the hypothesis that this unexpected scale shrinkage effect occurs because the items increase in complexity as they increase in difficulty, an approximate relationship is derived between the unidimensional model used in data analysis and a multidimensional model hypothesized to be generating the item responses. Scale shrinkage is successfully predicted for several sets of simulated data.The author is grateful to Robert Mislevy for kindly providing a copy of his computer program, RESOLVE.  相似文献   

11.
Recent studies on animal mathematical abilities suggest that all vertebrates show comparable abilities when they are given spontaneous preference tests, such as selecting the larger number of food items, but that mammals and birds generally achieve much better performance than fish when tested with training procedures. At least part of these differences might be due to the fact that fish are usually trained with only one or two dozen trials while extensive training, sometimes with thousands of trials, is normally performed in studies of mammals and birds. To test this hypothesis, female guppies were trained on four consecutive numerical discriminations of increasing difficulty (from 2 vs. 3 to 5 vs. 6 items), with up to 120 trials with each discrimination. Five out of eight subjects discriminated all contrasts up to 4 versus 5 objects at levels significantly better than chance, a much higher limit than the 2 versus 3 limit previously reported in studies that provided fish with only short training sequences. Our findings indicate that the difference in numerical cognition between teleosts and warm-blooded vertebrates might be smaller than previously supposed.  相似文献   

12.
Two chimpanzees (Pan troglodytes) made numerousness judgments of nonvisible sets of items. In Experiment 1, 1-10 items were dropped 1 at a time into an opaque cup, and then an additional 1-10 items were dropped 1 at a time into another opaque cup. The chimpanzees' performance levels were high and were more dependent on factors indicative of an analogue-magnitude mechanism for representation of set size than on an object file mechanism. In Experiment 2, a 3rd visible set was made available after the sequential presentation of the first 2 sets. The chimpanzees again performed at high levels in selecting the largest of the 3 sets. In Experiment 3, 1 of the 2 initially presented sets was reduced in number by the sequential removal of 1, 2, or 3 items. Both chimpanzees performed above chance levels for the removal of 1, but not more than 1, item.  相似文献   

13.
The stimulus order effect refers to the finding that recall in complex span tasks is better when span lists begin with a longer processing task and end with a shorter task than when these processing tasks are presented in the reverse order. This study independently manipulated processing time and processing difficulty between Long-final and Short-final lists. The processing task required participants to solve arithmetic problems with either verbal (Experiment 1) or visuospatial (Experiment 2) materials. The memory items used in the storage task were either digits (verbal material) or dots-in-matrices (visuospatial materials). Storage of both verbal and visuospatial materials was sensitive to the change in processing difficulty, but not processing time. Furthermore, this study provides further evidence for the asymmetry of domain interference in working memory. The similarities and differences between verbal and visuospatial storage in working memory are discussed.  相似文献   

14.
Sampling fluctuations resulting from the sampling of test items rather than of examinees are discussed. It is shown that the Kuder-Richardson reliability coefficients actually are measures of this type of sampling fluctuation. Formulas for certain standard errors are derived; in particular, a simple formula is given for the standard error of measurement of an individual examinee's score. A common misapplication of the Wilks-Votaw criterion for parallel tests is pointed out. It is shown that the Kuder-Richardson formula-21 reliability coefficient should be used instead of the formula-20 coefficient in certain common practical situations.Most of the work reported here was carried out under contract with the Office of Naval Research. The writer is indebted to Professor S. S. Wilks, who has checked over certain critical portions of a draft of this paper.  相似文献   

15.
Approximate methods of solving for discriminant functions have been tried on three sets of data. The principal illustration is the problem of finding a weighted sum of scores, on four psychological tests, so that men and women may be distinguished most clearly. The work starts from the complete solution, due to R. A. Fisher, where it is necessary to solve as many simultaneous equations, dependent on the standard deviations of the tests and their mutual correlations, as there are tests. It is proposed, by way of numerical simplification, that a set of equations be substituted where some one quantity replaces all the correlations. A solution is obtained where the weights to be assigned the tests are very simply expressed in terms of differences between the mean values of tests, the standard deviations of tests, and the said quantity. The difficulty remains of finding an estimate of the arbitrary constant that will give good discrimination. If an optimal solution is made a result is obtained which, in the three sets of data considered, is almost indistinguishable from that yielded by the complete solution. The calculation of this optimal common quantity is, however, itself so considerable that another estimate, previously suggested by R. W. B. Jackson, appears more profitable. This estimate is derived simply from the variability between the total scores for each subject and the variability of each test. Using this estimate, the discriminant functions can be rapidly calculated; the results compare very favorably, in the case of the data considered, with those from the complete solution.The present work was done while the writer was employed by the Ontario Department of Health.  相似文献   

16.
Hooded crows were trained in two-alternative simultaneous matching and oddity tasks with stimulus sets of three different categories: color (black and white), shape (Arabic Numerals 1 and 2, which were used as visual shapes only), and number of elements (arrays of one and two items). These three sets were used for training successively and repeatedly; the stimulus set was changed to the next one after the criterion (80% correct or better over 30 consecutive trials) was reached with the previous one. Training was continued until the criterion could be reached within the first 30 to 50 trials for each of the three training sets. During partial transfer tests, familiar stimuli (numerals and arrays in the range from 1 to 2) were paired with novel ones (numerals and arrays in the range from 3 to 4). At the final stage of testing only novel stimuli were presented (numerals and arrays in the range from 5 to 8). Four of 6 birds were able to transfer in these tests, and their performance was significantly above chance. Moreover, performance of the birds on the array stimuli did not differ from their performance on the color or shape stimuli. They were capable of recognizing the number of elements in arrays and comparing the stimuli by this attribute. It was concluded that crows were able to apply the matching (or oddity) concept to stimuli of numerical category.  相似文献   

17.
Stimulus sets defined in terms of artificial polymorphous concepts have frequently been used in experiments to investigate the mechanisms of discrimination of natural concepts, both in humans and in other animals. However, such stimulus sets are frequently difficult for either animals or humans to discriminate. Properties of artificial polymorphous stimulus sets that might explain this difficulty include the complexity of the individual stimuli, the unreliable reinforcement of individual positive features, attentional load, difficulties in discriminating some stimulus dimensions, memory load, and a lack of the correlation between features that characterizes natural concepts. An experiment using chickens as subjects and complex artificial visual stimulus sets investigated these hypotheses by training the birds in discriminations that were not polymorphous but did have some of the properties listed above. Discriminations that involved unreliable reinforcement or high attentional load were found to approach the difficulty of polymorphous concept discriminations, and these two factors together were sufficient to account for the entire difficulty. The usual kind of artificial polymorphous concept may not be a good model for natural concepts as they are perceived and discriminated by birds. A RULEX account of natural concept learning may be preferable.  相似文献   

18.
Stimulus sets defined in terms of artificial polymorphous concepts have frequently been used in experiments to investigate the mechanisms of discrimination of natural concepts, both in humans and in other animals. However, such stimulus sets are frequently difficult for either animals or humans to discriminate. Properties of artificial polymorphous stimulus sets that might explain this difficulty include the complexity of the individual stimuli, the unreliable reinforcement of individual positive features, attentional load, difficulties in discriminating some stimulus dimensions, memory load, and a lack of the correlation between features that characterizes natural concepts. An experiment using chickens as subjects and complex artificial visual stimulus sets investigated these hypotheses by training the birds in discriminations that were not polymorphous but did have some of the properties listed above. Discriminations that involved unreliable reinforcement or high attentional load were found to approach the difficulty of polymorphous concept discriminations, and these two factors together were sufficient to account for the entire difficulty. The usual kind of artificial polymorphous concept may not be a good model for natural concepts as they are perceived and discriminated by birds. A RULEX account of natural concept learning may be preferable.  相似文献   

19.
The evaluation of the level of difficulty of a test item is ordinarily derived from the proportion of a specified population passing or failing the item. With items that have a limited number of alternative responses there must be a correction in this proportion to make allowance for chance success. A table of corrected proportions is given for different numbers of alternatives varying from two to eight.  相似文献   

20.
The most common measure of agreement for categorical data is the coefficient kappa. However, kappa performs poorly when the marginal distributions are very asymmetric, it is not easy to interpret, and its definition is based on hypothesis of independence of the responses (which is more restrictive than the hypothesis that kappa has a value of zero). This paper defines a new measure of agreement, delta, ‘the proportion of agreements that are not due to chance’, which comes from model of multiple‐choice tests and does not have the previous limitations. The paper shows that kappa and delta generally take very similar values, except when the marginal distributions are strongly unbalanced. The case of the 2 × 2 tables (which admits very simple solutions) is considered in detail.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号