共查询到20条相似文献,搜索用时 31 毫秒
1.
Michael H. Siegel 《Behavior research methods》1969,1(8):289-290
The nature of liminal measurement is discussed, and the standard deviation is proposed asa suitable alternative measure to the limen. 相似文献
2.
Maurice Lorr 《Psychometrika》1944,9(1):17-30
For an amount-limit test homogeneous as to content and varied as to difficulty it is established that an individual's number-right score and his limen score as estimated by the constant process are mathematically related. The experimental and the theoretic relationship between normal deviate and limen score are shown to be in good agreement. It is also found that the two methods of evaluating individual test performance yield equally reliable sets of scores for the procedures used. Accordingly where the assumptions basic to the relationship obtain, the more conveniently computed raw score may be considered to be as valid and reliable an index of individual test performance as the limen score. The concept of the dispersion parameter of the individual as a measure of change or error in test score found no experimental verification. Estimates of individual variability are unrelated to differences in score on equivalent forms.The writer gratefully acknowledges Lt. Colonel M. W. Richardson's invaluable counsel, Dr. H. Gulliksen's helpful suggestions, and Dr. H. H. Long's aid in administering the tests. 相似文献
3.
Frank B. Baker 《Psychometrika》1961,26(2):239-246
Maximum likelihood estimates of item parameters of a scholastic aptitude test were computed using the normal and logistic models. The goodness of fit of ogives specified by the pairs of item parameters to the observed data was determined for all items. While negligible differences in the limen values were found, differences in item discrimination indices indicated that interpretation of these indices requires separate frames of reference. The empirical results showed the logistic model to be a useful alternative to the normal model in item analysis. 相似文献
4.
While most validity indices are based on total test scores, this paper describes a method for quantifying the construct validity of items. The approach is based on the item selection technique originally described by Piazza in 1980. Unfortunately, Piazza's P2 index suffers from some substantial limitations. The Dm coefficient provides an alternative which can be used for item selection and provides a validity index for a set of items. The index is similar to that of traditional criterion-related validity indices. Criterion-related validity is used to demonstrate the accuracy of hypothesized relations of the measure with outcome variables of interest in research and practice. This method may be useful when the sample of items or persons is small, rendering more traditional approaches such as factor analysis or item response theory inappropriate. An example of how to use the technique is provided. 相似文献
5.
It is noted that the average inter-item correlation, which represents the internal consistency of a test, yields a unique estimate of test reliability. A close approximation to this average is given by a formula which requires the correlation of each item with the total score and the standard deviation of each item. The formula is especially useful in those instances where the number of items is small and where the variation in item sigmas should not be neglected. 相似文献
6.
7.
In complex three-dimensional mental rotation tasks males have been reported to score up to one standard deviation higher than females. However, this effect size estimate could be compromised by the presence of gender bias at the item level, which calls the validity of purely quantitative performance comparisons into question. We hypothesized that the effect of gender bias at the level of distinct item design features could lead to either an over- or underestimation of reported effect sizes of the gender difference in three-dimensional mental rotation. Using automatic item generation we conducted a series of psychometric experiments in which we independently manipulated one out of four different item design features that have exhibited a gender bias in the previous studies (study 1). This was done in a between-subjects design. The results indicated that gender bias caused by item design features linked to the perceptual stadium of mental rotation led to an overestimation of the effect size of the gender difference while item design features associated with the encoding and transformational stadium resulted in an underestimation of the effect size of the gender difference. In study 2 we tested the hypothesis that the gender difference still remains while controlling for the item design features causing gender bias. The results suggest that a significant portion of the gender difference may be attributable to perceptual and encoding processes involved in mental rotation. 相似文献
8.
Indexes of skewness and kurtosis for a test-score distribution are expressed in terms of item parameters. Both are shown to depend, in part, on item means, variances, and covariances. The index of skewness depends also on trivariances. A trivariance is a product moment involving first powers of deviation scores for three items. The index of kurtosis depends on quadrivariances, as well as trivariances. A quadrivariance is a product moment involving first powers of deviation scores for four items. Empirical data are presented for responses of groups of subjects to 25 triads and 25 tetrads of items from five tests.Certain parts of this article represent the results of doctoral research conducted by Hundleby and Goldstein under the direction of Ray in the Department of Psychology at Pennsylvania State University. The authors are indebted to Professor Lester Guest and Professor William Lepley for their supervisory assistance in the final stages of the two dissertations during the absence of the senior author. 相似文献
9.
10.
11.
Frederic M. Lord 《Psychometrika》1958,23(4):291-296
Guttman's principal components for the weighting system are the item scoring weights that maximize the generalized Kuder-Richardson reliability coefficient. The principal component for any item is effectively the same as the factor loading of the item divided by the item standard deviation, the factor loadings being obtained from an ordinary factor analysis of the item intercorrelation matrix. 相似文献
12.
GULLIKSEN H 《Psychometrika》1950,15(3):259-269
Some methods are presented for estimating the reliability of a partially speeded test without the use of a parallel form. The effect of these formulas on some test data is illustrated. Whenever an odd-even reliability is computed it is probably desirable to use one of the formulas noted in Section 2 of this paper in addition to the usual Spearman-Brown correction. Since the formulas given here involve the mean and the standard deviation of the “number unattempted score,” a method is given in Section 4 for computing this mean and standard deviation from item analysis data. If the item analysis data are available, this method will save considerable time as compared with rescoring answer sheets. 相似文献
13.
In this article, four item selection methods in computerized adaptive testing are examined in terms of classification accuracy and consistency, including two popular heuristics for constraint management, the maximum priority index (MPI) method and the weighted deviation modeling method, as well as the widely known maximum Fisher information method and randomized item selection as baselines. Results suggest that the MPI method is able to meet constraints and keep test overlap rate low. Among the four methods, it is the only one that manages to produce parallel forms in terms of content coverage and, consequently, the only method to which the idea of classification consistency applies. With tests as short as 12 items, the MPI method does fairly well in classifying examinees accurately and consistently. Its performance improves with longer tests. The effects of number of decision categories and cut score locations are also examined. Recommendations are made in the Discussion section. 相似文献
14.
CATHERINE S. CLAUSE MORELL E. MULLINS MARGUERITE T. NEE ELAINE PULAKOS NEAL SCHMITT 《Personnel Psychology》1998,51(1):193-208
A procedure for developing alternate test forms that are parallel in the sense that scores on the different forms have similar means, standard deviations, and factor structures is described and applied to a bio-data inventory and a situational judgment test. Careful consideration of item-by-item parallelism during development resulted in alternate forms that were parallel at the item level. Further, comparison with a biodata test form comprised of items randomly selected from a pool of biodata items revealed that for the types of measures described here it may be necessary to produce parallel forms of each item to create alternate forms that are parallel in the way in which Cronbach (1947) originally defined parallelism. 相似文献
15.
The Spearman-K?rber method can be used to estimate the threshold value or difference limen in two-alternative forced-choice tasks. This method yields a simple estimator for the difference limen and its standard error, so that both can be calculated with a pocket calculator. In contrast to previous estimators, the present approach does not require any assumptions about the shape of the true underlying psychometric function. The performance of this new nonparametric estimator is compared with the standard technique of probit analysis. The Spearman-K?rber method appears to be a valuable addition to the toolbox of psychophysical methods, because it is most accurate for estimating the mean (i.e., absolute and difference thresholds) and dispersion of the psychometric function, although it is not optimal for estimating percentile-based parameters of this function. 相似文献
16.
The Spearman-Kärber method can be used to estimate the threshold value or difference limen in two-alternative forced-choice tasks. This method yields a simple estimator for the difference limen and its standard error, so that both can be calculated with a pocket calculator. In contrast to previous estimators, the present approach does not require any assumptions about the shape of the true underlying psychometric function. The performance of this new nonparametric estimator is compared with the standard technique of probit analysis. The Spearman-Kärber method appears to be a valuable addition to the toolbox of psychophysical methods, because it is most accurate for estimating the mean (i.e., absolute and difference thresholds) and dispersion of the psychometric function, although it is not optimal for estimating percentile-based parameters of this function. 相似文献
17.
Computerized adaptive testing under nonparametric IRT models 总被引:1,自引:0,他引:1
Nonparametric item response models have been developed as alternatives to the relatively inflexible parametric item response
models. An open question is whether it is possible and practical to administer computerized adaptive testing with nonparametric
models. This paper explores the possibility of computerized adaptive testing when using nonparametric item response models.
A central issue is that the derivatives of item characteristic Curves may not be estimated well, which eliminates the availability
of the standard maximum Fisher information criterion. As alternatives, procedures based on Shannon entropy and Kullback–Leibler
information are proposed. For a long test, these procedures, which do not require the derivatives of the item characteristic
eurves, become equivalent to the maximum Fisher information criterion. A simulation study is conducted to study the behavior
of these two procedures, compared with random item selection. The study shows that the procedures based on Shannon entropy
and Kullback–Leibler information perform similarly in terms of root mean square error, and perform much better than random
item selection. The study also shows that item exposure rates need to be addressed for these methods to be practical.
The authors would like to thank Hua Chang for his help in conducting this research. 相似文献
18.
With the advent of web-based technology, online testing is becoming a mainstream mode in large-scale educational assessments. Most online tests are administered continuously in a testing window, which may post test security problems because examinees who take the test earlier may share information with those who take the test later. Researchers have proposed various statistical indices to assess the test security, and one most often used index is the average test-overlap rate, which was further generalized to the item pooling index (Chang & Zhang, 2002, 2003). These indices, however, are all defined as the means (that is, the expected proportion of common items among examinees) and they were originally proposed for computerized adaptive testing (CAT). Recently, multistage testing (MST) has become a popular alternative to CAT. The unique features of MST make it important to report not only the mean, but also the standard deviation (SD) of test overlap rate, as we advocate in this paper. The standard deviation of test overlap rate adds important information to the test security profile, because for the same mean, a large SD reflects that certain groups of examinees share more common items than other groups. In this study, we analytically derived the lower bounds of the SD under MST, with the results under CAT as a benchmark. It is shown that when the mean overlap rate is the same between MST and CAT, the SD of test overlap tends to be larger in MST. A simulation study was conducted to provide empirical evidence. We also compared the security of MST under the single-pool versus the multiple-pool designs; both analytical and simulation studies show that the non-overlapping multiple-pool design will slightly increase the security risk. 相似文献
19.
Reported estimates of the frequency difference limen (DL) for tones show considerable variability. To determine the extent that the differences are dependent on psychophysical method, three estimates of the DL at 1,000 Hz were obtained from the same subjects for each of three psychophysical procedures. The three estimates were: (1) the standard deviation of final settings in a methbd of adjustment, (2) the average of several reversals in an adaptive two-interval forced-choice procedure, and (3) the 76%-correct point in a two-interval forced-choice procedure using constant stimuli. The two forced-choice procedures yielded very similar DLs. The adjustment procedure yielded significantly smaller estimates. Possible reasons for the different values produced by adjustment procedures and the nature of the underlying decision process are discussed. 相似文献