首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The nature of liminal measurement is discussed, and the standard deviation is proposed asa suitable alternative measure to the limen.  相似文献   

2.
For an amount-limit test homogeneous as to content and varied as to difficulty it is established that an individual's number-right score and his limen score as estimated by the constant process are mathematically related. The experimental and the theoretic relationship between normal deviate and limen score are shown to be in good agreement. It is also found that the two methods of evaluating individual test performance yield equally reliable sets of scores for the procedures used. Accordingly where the assumptions basic to the relationship obtain, the more conveniently computed raw score may be considered to be as valid and reliable an index of individual test performance as the limen score. The concept of the dispersion parameter of the individual as a measure of change or error in test score found no experimental verification. Estimates of individual variability are unrelated to differences in score on equivalent forms.The writer gratefully acknowledges Lt. Colonel M. W. Richardson's invaluable counsel, Dr. H. Gulliksen's helpful suggestions, and Dr. H. H. Long's aid in administering the tests.  相似文献   

3.
Maximum likelihood estimates of item parameters of a scholastic aptitude test were computed using the normal and logistic models. The goodness of fit of ogives specified by the pairs of item parameters to the observed data was determined for all items. While negligible differences in the limen values were found, differences in item discrimination indices indicated that interpretation of these indices requires separate frames of reference. The empirical results showed the logistic model to be a useful alternative to the normal model in item analysis.  相似文献   

4.
While most validity indices are based on total test scores, this paper describes a method for quantifying the construct validity of items. The approach is based on the item selection technique originally described by Piazza in 1980. Unfortunately, Piazza's P2 index suffers from some substantial limitations. The Dm coefficient provides an alternative which can be used for item selection and provides a validity index for a set of items. The index is similar to that of traditional criterion-related validity indices. Criterion-related validity is used to demonstrate the accuracy of hypothesized relations of the measure with outcome variables of interest in research and practice. This method may be useful when the sample of items or persons is small, rendering more traditional approaches such as factor analysis or item response theory inappropriate. An example of how to use the technique is provided.  相似文献   

5.
Babitz  Milton  Keys  Noel 《Psychometrika》1940,5(4):283-288
It is noted that the average inter-item correlation, which represents the internal consistency of a test, yields a unique estimate of test reliability. A close approximation to this average is given by a formula which requires the correlation of each item with the total score and the standard deviation of each item. The formula is especially useful in those instances where the number of items is small and where the variation in item sigmas should not be neglected.  相似文献   

6.
7.
In complex three-dimensional mental rotation tasks males have been reported to score up to one standard deviation higher than females. However, this effect size estimate could be compromised by the presence of gender bias at the item level, which calls the validity of purely quantitative performance comparisons into question. We hypothesized that the effect of gender bias at the level of distinct item design features could lead to either an over- or underestimation of reported effect sizes of the gender difference in three-dimensional mental rotation. Using automatic item generation we conducted a series of psychometric experiments in which we independently manipulated one out of four different item design features that have exhibited a gender bias in the previous studies (study 1). This was done in a between-subjects design. The results indicated that gender bias caused by item design features linked to the perceptual stadium of mental rotation led to an overestimation of the effect size of the gender difference while item design features associated with the encoding and transformational stadium resulted in an underestimation of the effect size of the gender difference. In study 2 we tested the hypothesis that the gender difference still remains while controlling for the item design features causing gender bias. The results suggest that a significant portion of the gender difference may be attributable to perceptual and encoding processes involved in mental rotation.  相似文献   

8.
Indexes of skewness and kurtosis for a test-score distribution are expressed in terms of item parameters. Both are shown to depend, in part, on item means, variances, and covariances. The index of skewness depends also on trivariances. A trivariance is a product moment involving first powers of deviation scores for three items. The index of kurtosis depends on quadrivariances, as well as trivariances. A quadrivariance is a product moment involving first powers of deviation scores for four items. Empirical data are presented for responses of groups of subjects to 25 triads and 25 tetrads of items from five tests.Certain parts of this article represent the results of doctoral research conducted by Hundleby and Goldstein under the direction of Ray in the Department of Psychology at Pennsylvania State University. The authors are indebted to Professor Lester Guest and Professor William Lepley for their supervisory assistance in the final stages of the two dissertations during the absence of the senior author.  相似文献   

9.
10.
多级评分计算机化自适应测验动态综合选题策略   总被引:1,自引:0,他引:1  
罗芬  丁树良  王晓庆 《心理学报》2012,44(3):400-412
多级评分可以提供更多关于被试的信息, 是计算机化自适应测验的一个发展方向, 选题策略是计算机化自适应测验的研究重点。对于多级评分的等级反应模型, 本文拟用区间估计的思想改进近期提出的几种选题策略, 并且将两级评分b-STR和a-STR推广到多级评分以改进最大信息量选题策略。Monte Carlo模拟实验表明在达到或接近原有选题策略测验精度的基础上, 本文提出的几种新选题策略有的能够有效降低测验长度, 有的可以极大降低项目曝光率。  相似文献   

11.
Guttman's principal components for the weighting system are the item scoring weights that maximize the generalized Kuder-Richardson reliability coefficient. The principal component for any item is effectively the same as the factor loading of the item divided by the item standard deviation, the factor loadings being obtained from an ordinary factor analysis of the item intercorrelation matrix.  相似文献   

12.
GULLIKSEN H 《Psychometrika》1950,15(3):259-269
Some methods are presented for estimating the reliability of a partially speeded test without the use of a parallel form. The effect of these formulas on some test data is illustrated. Whenever an odd-even reliability is computed it is probably desirable to use one of the formulas noted in Section 2 of this paper in addition to the usual Spearman-Brown correction. Since the formulas given here involve the mean and the standard deviation of the “number unattempted score,” a method is given in Section 4 for computing this mean and standard deviation from item analysis data. If the item analysis data are available, this method will save considerable time as compared with rescoring answer sheets.  相似文献   

13.
In this article, four item selection methods in computerized adaptive testing are examined in terms of classification accuracy and consistency, including two popular heuristics for constraint management, the maximum priority index (MPI) method and the weighted deviation modeling method, as well as the widely known maximum Fisher information method and randomized item selection as baselines. Results suggest that the MPI method is able to meet constraints and keep test overlap rate low. Among the four methods, it is the only one that manages to produce parallel forms in terms of content coverage and, consequently, the only method to which the idea of classification consistency applies. With tests as short as 12 items, the MPI method does fairly well in classifying examinees accurately and consistently. Its performance improves with longer tests. The effects of number of decision categories and cut score locations are also examined. Recommendations are made in the Discussion section.  相似文献   

14.
A procedure for developing alternate test forms that are parallel in the sense that scores on the different forms have similar means, standard deviations, and factor structures is described and applied to a bio-data inventory and a situational judgment test. Careful consideration of item-by-item parallelism during development resulted in alternate forms that were parallel at the item level. Further, comparison with a biodata test form comprised of items randomly selected from a pool of biodata items revealed that for the types of measures described here it may be necessary to produce parallel forms of each item to create alternate forms that are parallel in the way in which Cronbach (1947) originally defined parallelism.  相似文献   

15.
The Spearman-K?rber method can be used to estimate the threshold value or difference limen in two-alternative forced-choice tasks. This method yields a simple estimator for the difference limen and its standard error, so that both can be calculated with a pocket calculator. In contrast to previous estimators, the present approach does not require any assumptions about the shape of the true underlying psychometric function. The performance of this new nonparametric estimator is compared with the standard technique of probit analysis. The Spearman-K?rber method appears to be a valuable addition to the toolbox of psychophysical methods, because it is most accurate for estimating the mean (i.e., absolute and difference thresholds) and dispersion of the psychometric function, although it is not optimal for estimating percentile-based parameters of this function.  相似文献   

16.
The Spearman-Kärber method can be used to estimate the threshold value or difference limen in two-alternative forced-choice tasks. This method yields a simple estimator for the difference limen and its standard error, so that both can be calculated with a pocket calculator. In contrast to previous estimators, the present approach does not require any assumptions about the shape of the true underlying psychometric function. The performance of this new nonparametric estimator is compared with the standard technique of probit analysis. The Spearman-Kärber method appears to be a valuable addition to the toolbox of psychophysical methods, because it is most accurate for estimating the mean (i.e., absolute and difference thresholds) and dispersion of the psychometric function, although it is not optimal for estimating percentile-based parameters of this function.  相似文献   

17.
Computerized adaptive testing under nonparametric IRT models   总被引:1,自引:0,他引:1  
Nonparametric item response models have been developed as alternatives to the relatively inflexible parametric item response models. An open question is whether it is possible and practical to administer computerized adaptive testing with nonparametric models. This paper explores the possibility of computerized adaptive testing when using nonparametric item response models. A central issue is that the derivatives of item characteristic Curves may not be estimated well, which eliminates the availability of the standard maximum Fisher information criterion. As alternatives, procedures based on Shannon entropy and Kullback–Leibler information are proposed. For a long test, these procedures, which do not require the derivatives of the item characteristic eurves, become equivalent to the maximum Fisher information criterion. A simulation study is conducted to study the behavior of these two procedures, compared with random item selection. The study shows that the procedures based on Shannon entropy and Kullback–Leibler information perform similarly in terms of root mean square error, and perform much better than random item selection. The study also shows that item exposure rates need to be addressed for these methods to be practical. The authors would like to thank Hua Chang for his help in conducting this research.  相似文献   

18.
With the advent of web-based technology, online testing is becoming a mainstream mode in large-scale educational assessments. Most online tests are administered continuously in a testing window, which may post test security problems because examinees who take the test earlier may share information with those who take the test later. Researchers have proposed various statistical indices to assess the test security, and one most often used index is the average test-overlap rate, which was further generalized to the item pooling index (Chang & Zhang, 2002, 2003). These indices, however, are all defined as the means (that is, the expected proportion of common items among examinees) and they were originally proposed for computerized adaptive testing (CAT). Recently, multistage testing (MST) has become a popular alternative to CAT. The unique features of MST make it important to report not only the mean, but also the standard deviation (SD) of test overlap rate, as we advocate in this paper. The standard deviation of test overlap rate adds important information to the test security profile, because for the same mean, a large SD reflects that certain groups of examinees share more common items than other groups. In this study, we analytically derived the lower bounds of the SD under MST, with the results under CAT as a benchmark. It is shown that when the mean overlap rate is the same between MST and CAT, the SD of test overlap tends to be larger in MST. A simulation study was conducted to provide empirical evidence. We also compared the security of MST under the single-pool versus the multiple-pool designs; both analytical and simulation studies show that the non-overlapping multiple-pool design will slightly increase the security risk.  相似文献   

19.
Reported estimates of the frequency difference limen (DL) for tones show considerable variability. To determine the extent that the differences are dependent on psychophysical method, three estimates of the DL at 1,000 Hz were obtained from the same subjects for each of three psychophysical procedures. The three estimates were: (1) the standard deviation of final settings in a methbd of adjustment, (2) the average of several reversals in an adaptive two-interval forced-choice procedure, and (3) the 76%-correct point in a two-interval forced-choice procedure using constant stimuli. The two forced-choice procedures yielded very similar DLs. The adjustment procedure yielded significantly smaller estimates. Possible reasons for the different values produced by adjustment procedures and the nature of the underlying decision process are discussed.  相似文献   

20.
丁树良  毛萌萌  汪文义  罗芬  CUI Ying 《心理学报》2012,44(11):1535-1546
构建正确的认知模型是成功进行认知诊断的关键之一,如果认知诊断测验不能完整准确地代表这个认知模型,这个测验的效度就存在问题.属性及其层级可以表示一个认知模型.在认知模型正确基础上,给出了一个计量公式以衡量认知诊断测验能够多大程度上代表认知模型;对于不止包含一个知识状态的等价类及其形成原因进行了分析,对Cui等人的属性层级相合性指标(HCI)提出修改建议,以更好地探查数据与专家给出的认知模型的一致性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号