首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The test-taking behaviour of some examinees may be so unusual that their test scores cannot be regarded as appropriate measures of their ability. Appropriateness measurement is a model-based approach to the problem of identifying these test scores. The intuitions and basic theory supporting appropriateness measurement are presented together with a critical review of earlier work and a series of interrelated experiments. We conclude that appropriateness measurement techniques are robust to errors in parameter estimation and robust to the presence of unidentified aberrant examinees in the test norming sample. In addition, the frequently criticized ‘three-parameter logistic’ latent trait model was found to be adequate for the detection of spuriously low scores in actual test data.  相似文献   

2.
与传统的纸笔测验(Paper And Pencil Based Test, P&P)相比计算机化自适应测验(Computerized Adaptive Testing, CAT)根据被试的作答反应自适应地选择题目, 它不仅缩短了测验长度, 还极大地提高了测验的准确性。然而, 目前绝大多数CAT不允许被试修改答案, 研究者主要担心修改答案会降低CAT的有效性。允许修改答案符合被试一贯的测验习惯, 修改之后的分数更能反映被试真实的水平, 从而能够进一步促进CAT在实际中的应用。现有的研究主要从三个方面提出了可修改答案CAT的控制方法:一是测验设计; 二是改进选题策略; 三是建构模型。未来的研究应进一步探讨这些方法之间的比较与结合, 以及对可修改答案认知诊断CAT (Cognitive Diagnostic CAT, CD-CAT)的研究。  相似文献   

3.
If a loss function is available specifying the social cost of an error of measurement in the score on a unidimensional test, an asymptotic method, based on item response theory, is developed for optimal test design for a specified target population of examinees. Since in the real world such loss functions are not available, it is more useful to reverse this process; thus a method is developed for finding the loss function for which a given test is an optimally designed test for the target population. An illustrative application is presented for one operational test.This work was supported in part by contract N00014-80-C-0402, project designation NR 150-453 between the Office of Naval Research and Educational Testing Service. Reproduction in whole or in part is permitted for any purpose of the United States Government.  相似文献   

4.
The use of computer-based assessments makes the collection of detailed data that capture examinees’ progress in the tests and time spent on individual actions possible. This article presents a study using process and timing data to aid understanding of an international language assessment and the examinees. Issues regarding test-taking strategies, test speededness, test design, and their relationship to examinees’ demographic backgrounds and performance are also discussed.  相似文献   

5.
Standard procedures for estimating item parameters in item response theory (IRT) ignore collateral information that may be available about examinees, such as their standing on demographic and educational variables. This paper describes circumstances under which collateral information about examineesmay be used to make inferences about item parameters more precise, and circumstances under which itmust be used to obtain correct inferences.This work was supported by Contract No. N00014-85-K-0683, project designation NR 150-539, from the Cognitive Science Program, Cognitive and Neural Sciences Division, Office of Naval Research. Reproduction in whole or in part is permitted for any purpose of the United States Government. We are indebted to Tim Davey, Eugene Johnson, and three anonymous referees for their comments on earlier versions of the paper.  相似文献   

6.
The problem of predicting universe scores for samples of examinees based on their responses to samples of items is treated. A general measurement procedure is described in which multiple test forms are developed from a table of specifications and each form is administered to a different sample of examinees. The measurement model categorizes items according to the cells of such a table, and the linear function derived for minimizing error variance in prediction uses responses to these categories. In addition, some distinctions are drawn between aspects of the approach taken here and the familiar regressed score estimates.The author thanks Robert L. Brennan, Michael J. Kolen, and Richard Sawyer for helpful comments and corrections, and anonymous reviewers for suggested improvements.  相似文献   

7.
A numerical procedure is outlined for obtaining an interval estimate of the regression of true score on observed score. Only the frequency distribution of observed scores is needed for this. The procedure assumes that the conditional distribution of observed scores for fixed true score is binomial. The procedure is applied to several sets of test data.This research was sponsored in part by the Personnel and Training Research Programs, Psychological Sciences Division, Office of Naval Research, under Contract No. N00014-69-C-0017, Contract Authority Identification Number, NR No. 150-303, and Educational Testing Service. Reproduction in whole or in part is permitted for any purpose of the United States Government.  相似文献   

8.
Rationale and the actual procedures of two nonparametric approaches, called Bivariate P.D.F. Approach and Conditional P.D.F. Approach, for estimating the operating characteristic of a discrete item response, or the conditional probability, given latent trait, that the examinee's response be that specific response, are introduced and discussed. These methods are featured by the facts that: (a) estimation is made without assuming any mathematical forms, and (b) it is based upon a relatively small sample of several hundred to a few thousand examinees.Some examples of the results obtained by the Simple Sum Procedure and the Differential Weight Procedure of the Conditional P.D.F. Approach are given, using simulated data. The usefulness of these nonparametric methods is also discussed.This research was mostly supported by the Office of Naval Research (N00014-77-C-0360, N00014-81-C-0569, N00014-87-K-0320, N00014-90-J-1456).  相似文献   

9.
Information functions are used to find the optimum ability levels and maximum contributions to information for estimating item parameters in three commonly used logistic item response models. For the three and two parameter logistic models, examinees who contribute maximally to the estimation of item difficulty contribute little to the estimation of item discrimination. This suggests that in applications that depend heavily upon the veracity of individual item parameter estimates (e.g. adaptive testing or text construction), better item calibration results may be obtained (for fixed sample sizes) from examinee calibration samples in which ability is widely dispersed.This work was supported by Contract No. N00014-83-C-0457, project designation NR 150-520, from Cognitive Science Program, Cognitive and Neural Sciences Division, Office of Naval Research and Educational Testing Service through the Program Research Planning Council. Reproduction in whole or in part is permitted for any purpose of the United States Government. The author wishes to acknowledge the invaluable assistance of Maxine B. Kingston in carrying out this study, and to thank Charles Lewis for his many insightful comments on earlier drafts of this paper.  相似文献   

10.
This study proposes a new item parameter linking method for the common-item nonequivalent groups design in item response theory (IRT). Previous studies assumed that examinees are randomly assigned to either test form. However, examinees can frequently select their own test forms and tests often differ according to examinees’ abilities. In such cases, concurrent calibration or multiple group IRT modeling without modeling test form selection behavior can yield severely biased results. We proposed a model wherein test form selection behavior depends on test scores and used a Monte Carlo expectation maximization (MCEM) algorithm. This method provided adequate estimates of testing parameters.  相似文献   

11.
In item response theory (IRT), the invariance property states that item parameter estimates are independent of the examinee sample, and examinee ability estimates are independent of the test items. While this property has long been established and understood by the measurement community for IRT models, the same cannot be said for diagnostic classification models (DCMs). DCMs are a newer class of psychometric models that are designed to classify examinees according to levels of categorical latent traits. We examined the invariance property for general DCMs using the log-linear cognitive diagnosis model (LCDM) framework. We conducted a simulation study to examine the degree to which theoretical invariance of LCDM classifications and item parameter estimates can be observed under various sample and test characteristics. Results illustrated that LCDM classifications and item parameter estimates show clear invariance when adequate model data fit is present. To demonstrate the implications of this important property, we conducted additional analyses to show that using pre-calibrated tests to classify examinees provided consistent classifications across calibration samples with varying mastery profile distributions and across tests with varying difficulties.  相似文献   

12.
Assuming a nonparametric family of item response theory models, a theory-based procedure for testing the hypothesis of unidimensionality of the latent space is proposed. The asymptotic distribution of the test statistic is derived assuming unidimensionality, thereby establishing an asymptotically valid statistical test of the unidimensionality of the latent trait. Based upon a new notion of dimensionality, the test is shown to have asymptotic power 1. A 6300 trial Monte Carlo study using published item parameter estimates of widely used standardized tests indicates conservative adherence to the nominal level of significance and statistical power averaging 81 out of 100 rejections for examinee sample sizes and psychological test lengths often incurred in practice.The referees' comments were remarkably detailed and greatly enhanced the writeup and sensitized the author to certain pertinent issues. Discussions with Fritz Drasgow, Lloyd Humphreys, Dennis Jennings, Brian Junker, Robert Linn, Ratna Nandakumar, and Robin Shealy were also very useful.This research was supported by the Office of Naval Research under grant N00014-84-K-0186; NR 150-533, and by the National Science Foundation under grant DMS 85-03321.  相似文献   

13.
Cognitive diagnosis models of educational test performance rely on a binary Q‐matrix that specifies the associations between individual test items and the cognitive attributes (skills) required to answer those items correctly. Current methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes are based on parametric estimation methods such as expectation maximization (EM) and Markov chain Monte Carlo (MCMC) that frequently encounter difficulties in practical applications. In response to these difficulties, non‐parametric classification techniques (cluster analysis) have been proposed as heuristic alternatives to parametric procedures. These non‐parametric classification techniques first aggregate each examinee's test item scores into a profile of attribute sum scores, which then serve as the basis for clustering examinees into proficiency classes. Like the parametric procedures, the non‐parametric classification techniques require that the Q‐matrix underlying a given test be known. Unfortunately, in practice, the Q‐matrix for most tests is not known and must be estimated to specify the associations between items and attributes, risking a misspecified Q‐matrix that may then result in the incorrect classification of examinees. This paper demonstrates that clustering examinees into proficiency classes based on their item scores rather than on their attribute sum‐score profiles does not require knowledge of the Q‐matrix, and results in a more accurate classification of examinees.  相似文献   

14.
In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE®General Analytical Writing and until 2009 in the case of TOEFL® iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e‐rater®. In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a consequence, a new estimation method is suggested based on samples of examinees who have taken an assessment more than once. Such samples are typically not random samples of the general population of examinees, so that we apply statistical adjustment methods to obtain the needed estimated variances and covariances of measurement errors. To examine practical implications of the suggested methods of analysis, applications are made to GRE General Analytical Writing and TOEFL iBT Writing. Results obtained indicate that substantial improvements are possible both in terms of reliability of scoring and in terms of assessment reliability.  相似文献   

15.
To explore whether test-taking styles (performance factors)cancontribute to explaining gender. related differences on tests of spatial ability, 15 spatial tests were administered to three samples of subjects. On each test, number-correct scores and ratio scores (number of items solved divided by the number of items attempted) were computed. In accordance with previous research findings, the use ofratio scores significantly reduced the magnitude ofthe gender-related differences on the Mental Rotations Test. For most of the remaining tests, however, the reduction of the gender-related score difference was small. It was concluded that the difference reduction for the Mental Rotations Test was specific to the format of this test. In common spatial tests, performance factors may account for a small portion of gender-related variance, but the bulk of this variance must be attributed to other factors.  相似文献   

16.
In low-stakes assessments, test performance has few or no consequences for examinees themselves, so that examinees may not be fully engaged when answering the items. Instead of engaging in solution behaviour, disengaged examinees might randomly guess or generate no response at all. When ignored, examinee disengagement poses a severe threat to the validity of results obtained from low-stakes assessments. Statistical modelling approaches in educational measurement have been proposed that account for non-response or for guessing, but do not consider both types of disengaged behaviour simultaneously. We bring together research on modelling examinee engagement and research on missing values and present a hierarchical latent response model for identifying and modelling the processes associated with examinee disengagement jointly with the processes associated with engaged responses. To that end, we employ a mixture model that identifies disengagement at the item-by-examinee level by assuming different data-generating processes underlying item responses and omissions, respectively, as well as response times associated with engaged and disengaged behaviour. By modelling examinee engagement with a latent response framework, the model allows assessing how examinee engagement relates to ability and speed as well as to identify items that are likely to evoke disengaged test-taking behaviour. An illustration of the model by means of an application to real data is presented.  相似文献   

17.
The many null distributions of person fit indices   总被引:1,自引:0,他引:1  
This paper deals with the situation of an investigator who has collected the scores ofn persons to a set ofk dichotomous items, and wants to investigate whether the answers of all respondents are compatible with the one parameter logistic test model of Rasch. Contrary to the standard analysis of the Rasch model, where all persons are kept in the analysis and badly fittingitems may be removed, this paper studies the alternative model in which a small minority ofpersons has an answer strategy not described by the Rasch model. Such persons are called anomalous or aberrant. From the response vectors consisting ofk symbols each equal to 0 or 1, it is desired to classify each respondent as either anomalous or as conforming to the model. As this model is probabilistic, such a classification will possibly involve false positives and false negatives. Both for the Rasch model and for other item response models, the literature contains several proposals for a person fit index, which expresses for each individual the plausibility that his/her behavior follows the model. The present paper argues that such indices can only provide a satisfactory solution to the classification problem if their statistical distribution is known under the null hypothesis that all persons answer according to the model. This distribution, however, turns out to be rather different for different values of the person's latent trait value. This value will be called ability parameter, although our results are equally valid for Rasch scales measuring other attributes.As the true ability parameter is unknown, one can only use its estimate in order to obtain an estimated person fit value and an estimated null hypothesis distribution. The paper describes three specifications for the latter: assuming that the true ability equals its estimate, integrating across the ability distribution assumed for the population, and conditioning on the total score, which is in the Rasch model the sufficient statistic for the ability parameter.Classification rules for aberrance will be worked out for each of the three specifications. Depending on test length, item parameters and desired accuracy, they are based on the exact distribution, its Monte Carlo estimate and a new and promising approximation based on the moments of the person fit statistic. Results for the likelihood person fit statistic are given in detail, the methods could also be applied to other fit statistics. A comparison of the three specifications results in the recommendation to condition on the total score, as this avoids some problems of interpretation that affect the other two specifications.The authors express their gratitude to the reviewers and to many colleagues for comments on an earlier version.  相似文献   

18.
Computerized classification testing (CCT) commonly chooses items maximizing information at the cut score, which yields the most information for decision-making. However, a corollary problem is that all examinees will be given the same set of items, resulting in high test overlap rate and unbalanced item bank usage, which threatens test security. Moreover, another pivotal issue for CCT is time control. Since both the extremely long response time (RT) and large RT variability across examinees intensify time-induced anxiety, it is crucial to reduce the number of examinees exceeding the time limitation and the differences between examinees' test-taking times. To satisfy these practical needs, this paper proposes the novel idea of stage adaptiveness to tailor the item selection process to the decision-making requirement in each step and generate fresh insight into the existing response time selection method. Results indicate that a balanced item usage as well as short and stable test times across examinees can be achieved via the new methods.  相似文献   

19.
Score equity assessment (SEA) refers to an examination of population invariance of equating across two or more subpopulations of test examinees. Previous SEA studies have shown that score equity may be present for examinees scoring at particular test score ranges but absent for examinees scoring at other score ranges. No studies to date have performed research for the purpose of understanding why score equity can be inconsistent across the score range of some tests. The purpose of this study is to explore a source of uneven subpopulation score equity across the score range of a test. It is hypothesized that the difficulty of anchor items displaying differential item functioning (DIF) is directly related to the score location at which issues of score inequity are observed. The simulation study supports the hypothesis that the difficulty of DIF items has a systematic impact on the uneven nature of conditional score equity.  相似文献   

20.
以生活满意度量表为例,运用实证性因素分析,考察在中国文化下网络测验和传统纸笔测验之间的测量不变性。结果显示,网络测验和纸笔测验之间存在弱不变性,即网络测验和纸笔测验有着相同的测量单位;但网络测验和纸笔测验只存在部分的强不变性和部分的严格不变性,测验实施环境对结果的影响不可忽视。该研究表明,恰当设计的网络测验是可靠的,同时还提示,当一个测验在不同情境下运用时,检验测量不变性十分必要  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号