共查询到20条相似文献,搜索用时 15 毫秒
1.
Analytic smoothing for equipercentile equating under the common item nonequivalent populations design 总被引:1,自引:0,他引:1
A cubic spline method for smoothing equipercentile equating relationships under the common item nonequivalent populations design is described. Statistical techniques based on bootstrap estimation are presented that are designed to aid in choosing an equating method/degree of smoothing. These include: (a) asymptotic significance tests that compare no equating and linear equating to equipercentile equating; (b) a scheme for estimating total equating error and for dividing total estimated error into systematic and random components. The smoothing technique and statistical procedures are explored and illustrated using data from forms of a professional certification test. 相似文献
2.
In the design of common-item equating, two groups of examinees are administered separate test forms, and each test form contains a common subset of items. We consider test equating under this situation as an incomplete data problem—that is, examinees have observed scores on one test form and missing scores on the other. Through the use of statistical data-imputation techniques, the missing scores can be replaced by reasonable estimates, and consequently the forms may be directly equated as if both forms were administered to both groups. In this paper we discuss different data-imputation techniques that are useful for equipercentile equating; we also use empirical data to evaluate the accuracy of these techniques as compared with chained equipercentile equating.A paper presented at the European Meeting of the Psychometric Society, Barcelona, Spain, July, 1993. 相似文献
3.
研究旨在探索无铆题情况下,使用构造铆测验法,实现测验分数等值。研究一和研究二分别探索题目难度排序错误、铆题难度差异对构造铆测验法的影响。结果表明:(1)等组条件下,随着错误铆题比例,难度排序错误程度,铆题难度差异增大,构造铆测验法的等值误差逐渐增大,随机等组法的等值误差较为稳定;不等组条件下,构造铆测验法的等值误差均小于随机等组法;(2)对于构造铆测验法,在不等组条件下,铆测验长度越短,等值误差越大。 相似文献
4.
5.
The Non-Equivalent groups with Anchor Test (NEAT) design involves missing
data that are missing by design. Three nonlinear observed score equating methods used with a NEAT design are the frequency estimation equipercentile equating (FEEE), the chain equipercentile equating (CEE), and the item-response-theory observed-score-equating (IRT OSE). These three methods each make different assumptions about the missing data in the NEAT design. The FEEE method
assumes that the conditional distribution of the test score given the anchor test score is the same in the two examinee groups.
The CEE method assumes that the equipercentile functions equating the test score to the anchor test score are the same in
the two examinee groups. The IRT OSE method assumes that the IRT model employed fits the data adequately, and the items in
the tests and the anchor test do not exhibit differential item functioning across the two examinee groups. This paper first
describes the missing data assumptions of the three equating methods. Then it describes how the missing data in the NEAT design
can be filled in a manner that is coherent with the assumptions made by each of these equating methods. Implications on equating
are also discussed. 相似文献
6.
基于经典测验理论(CTT)的等值方法主要有线性等值和等百分位等值两种。在不同情境下,不同的等值方法会产生不同的等值结果。本研究以真分数等值为依据,用蒙特卡洛模拟研究方法,综合比较了各种题目难度分布条件下和各种样本容量条件下两种CTT等值方法的等值结果。研究结果表明:(1)线性等值的误差受题目难度分布影响较大,等百分位等值的误差几乎不受题目难度分布影响。(2)线性等值的误差几乎不受样本容量的影响,等百分位等值的误差受样本容量影响较大。(3)不论题目难度分布如何,只要样本容量足够大,等百分位等值的效果都比线性等值更好。 相似文献
7.
8.
9.
In standardized testing, equating is used to ensure comparability of test scores across multiple test administrations. One equipercentile observed-score equating method is kernel equating, where an essential step is to obtain continuous approximations to the discrete score distributions by applying a kernel with a smoothing bandwidth parameter. When estimating the bandwidth, additional variability is introduced which is currently not accounted for when calculating the standard errors of equating. This poses a threat to the accuracy of the standard errors of equating. In this study, the asymptotic variance of the bandwidth parameter estimator is derived and a modified method for calculating the standard error of equating that accounts for the bandwidth estimation variability is introduced for the equivalent groups design. A simulation study is used to verify the derivations and confirm the accuracy of the modified method across several sample sizes and test lengths as compared to the existing method and the Monte Carlo standard error of equating estimates. The results show that the modified standard errors of equating are accurate under the considered conditions. Furthermore, the modified and the existing methods produce similar results which suggest that the bandwidth variability impact on the standard error of equating is minimal. 相似文献
10.
11.
Russell G. Almond 《International Journal of Testing》2014,14(1):73-91
Assessments consisting of only a few extended constructed response items (essays) are not typically equated using anchor test designs as there are typically too few essay prompts in each form to allow for meaningful equating. This article explores the idea that output from an automated scoring program designed to measure writing fluency (a common objective of many writing prompts) can be used in place of a more traditional anchor. The linear-logistic equating method used in this article is a variant of the Tucker linear equating method appropriate for the limited score range typical of essays. The procedure is applied to historical data. Although the procedure only results in small improvements over identity equating (not equating prompts), it does produce a viable alternative, and a mechanism for checking that the identity equating is appropriate. This may be particularly useful for measuring rater drift or equating mixed format tests. 相似文献
12.
David Andrich 《Psychometrika》2010,75(2):292-308
Rasch models are characterised by sufficient statistics for all parameters. In the Rasch unidimensional model for two ordered
categories, the parameterisation of the person and item is symmetrical and it is readily established that the total scores
of a person and item are sufficient statistics for their respective parameters. In contrast, in the unidimensional polytomous
Rasch model for more than two ordered categories, the parameterisation is not symmetrical. Specifically, each item has a vector
of item parameters, one for each category, and each person only one person parameter. In addition, different items can have
different numbers of categories and, therefore, different numbers of parameters. The sufficient statistic for the parameters
of an item is itself a vector. In estimating the person parameters in presently available software, these sufficient statistics
are not used to condition out the item parameters. This paper derives a conditional, pairwise, pseudo-likelihood and constructs
estimates of the parameters of any number of persons which are independent of all item parameters and of the maximum scores
of all items. It also shows that these estimates are consistent. Although Rasch’s original work began with equating tests
using test scores, and not with items of a test, the polytomous Rasch model has not been applied in this way. Operationally,
this is because the current approaches, in which item parameters are estimated first, cannot handle test data where there
may be many scores with zero frequencies. A small simulation study shows that, when using the estimation equations derived
in this paper, such a property of the data is no impediment to the application of the model at the level of tests. This opens
up the possibility of using the polytomous Rasch model directly in equating test scores. 相似文献
13.
A Bayesian nonparametric model is introduced for score equating. It is applicable to all major equating designs, and has advantages
over previous equating models. Unlike the previous models, the Bayesian model accounts for positive dependence between distributions
of scores from two tests. The Bayesian model and the previous equating models are compared through the analysis of data sets
famous in the equating literature. Also, the classical percentile-rank, linear, and mean equating models are each proven to
be a special case of a Bayesian model under a highly-informative choice of prior distribution. 相似文献
14.
Wim J. van der Linden 《Psychometrika》2000,65(4):437-456
Observed-score equating using the marginal distributions of two tests is not necessarily the universally best approach it
has been claimed to be. On the other hand, equating using the conditional distributions given the ability level of the examinee
is theoretically ideal. Possible ways of dealing with the requirement of known ability are discussed, including such methods
as conditional observed-score equating at point estimates or posterior expected conditional equating. The methods are generalized
to the problem of observed-score equating with a multivariate ability structure underlying the scores.
This article is based on the author's Presidential Address given on July 7, 2000 at the 65th Annual Meeting of the Psychometric
Society held at the University of British Columbia, Vancouver, Canada.
The author is most indebted to Wim M.M. Tielen for his computational assistance and Cees A.W. Glas for his comments on a draft
of this paper. 相似文献
15.
In the discussion of mean square difference (MSD) and standard error of measurement (SEM), Barchard (2012) concluded that the MSD between 2 sets of test scores is greater than 2(SEM)2 and SEM underestimates the score difference between 2 tests when the 2 tests are not parallel. This conclusion has limitations for 2 reasons. First, strictly speaking, MSD should not be compared to SEM because they measure different things, have different assumptions, and capture different sources of errors. Second, the related proof and conclusions in Barchard hold only under the assumptions of equal reliabilities, homogeneous variances, and independent measurement errors. To address the limitations, we propose that MSD should be compared to the standard error of measurement of difference scores (SEMx-y) so that the comparison can be extended to the conditions when 2 tests have unequal reliabilities and score variances. 相似文献
16.
题组越来越多地出现在各类考试中, 采用标准的IRT模型对有题组的测验等值, 可能因忽略题组的局部相依性导致等值结果的失真。为解决此问题, 我们采用基于题组的2PTM模型及IRT特征曲线法等值, 以等值系数估计值的误差大小作为衡量标准, 以Wilcoxon符号秩检验为依据, 在几种不同情况下进行了大量的Monte Carlo模拟实验。实验结果表明, 考虑了局部相依性的题组模型2PTM绝大部分情况下都比2PLM等值的误差小且有显著性差异。另外, 用6种不同等值准则对2PTM等值并评价了不同条件下等值准则之间的优劣。 相似文献
17.
Recent research on curriculum-based measurement of oral reading fluency has revealed important issues in current passage development procedures, highlighting how dissimilar passages are problematic for monitoring student progress. The purpose of this paper is to describe statistical equating as an option for achieving equivalent scores across non-parallel reading passages. The psychometric and design properties of words-correct scores are examined, and the requirements of traditional equating methods are discussed. Simulated and empirical words-correct scores are used to demonstrate the steps in the equating process and the situations in which each method is most appropriate. 相似文献
18.
A method of the IRT observed-score equating using chain equating through a third test without equating coefficients is presented with the assumption of the three-parameter logistic model. The asymptotic standard errors of the equated scores by this method are obtained using the results given by M. Liou and P.E. Cheng. The asymptotic standard errors of the IRT observed-score equating method using a synthetic examinee group with equating coefficients, which is a currently used method, are also provided. Numerical examples show that the standard errors by these observed-score equating methods are similar to those by the corresponding true score equating methods except in the range of low scores.The author is indebted to Michael J. Kolen for access to the real data used in this article and anonymous reviewers for their corrections and suggestions on this work. 相似文献
19.
In the theory of test validity it is assumed that error scores on two distinct tests, a predictor and a criterion, are uncorrelated. The expected-value concept of true score in the calssical test-theory model as formulated by Lord and Novick, Guttman, and others, implies mathematically, without further assumptions, that true scores and error scores are uncorrelated. This concept does not imply, however, that error scores on two arbitrary tests are uncorrelated, and an additional axiom of “experimental independence” is needed in order to obtain familiar results in the theory of test validity. The formulas derived in the present paper do not depend on this assumption and can be applied to all test scores. These more general formulas reveal some unexpected and anomalous properties of test validty and have implications for the interpretation of validity coefficients in practice. Under some conditions there is no attenuation produced by error of measurement, and the correlation between observed scores sometimes can exceed the correlation between true scores, so that the usual correction for attenuation may be inappropriate and misleading. Observed scores on two tests can be positively correlated even when true scores are negatively correlated, and the validity coefficient can exceed the index of reliability. In some cases of practical interest, the validity coefficient will decrease with increase in test length. These anomalies sometimes occur even when the correlation between error scores is quite small, and their magnitude is inversely related to test reliability. The elimination of correlated errors in practice will not enhance a test's predictive value, but will restore the properties of the validity coefficient that are familiar in the classical theory. 相似文献
20.
Claims have been made that grade appropriate curriculum-based measurement of reading (CBM-R) passages are of comparable difficulty and can be used interchangeably to monitor student progress. Empirical evidence to support claims of equivalence has been lacking. This research investigated the basis for making claims of equivalence. The use of readability statistics to justify passage equivalence was found to be lacking. Using a general measurement model for congeneric tests, CBM-R passages were found to measure a single latent factor with a high degree of reliability. However, evidence indicated that the raw scores, words read correctly per minute, across passages did not provide equivalent measurements. Statistical equating was investigated as an approach to overcome the lack of equivalence with promising results. 相似文献