首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Optimal appropriateness measurement   总被引:2,自引:0,他引:2  
The test-taking behavior of some examinees may be so idiosyncratic that their test scores may not be comparable to the scores of more typical examinees. Appropriateness measurement attempts to use answer patterns to recognize atypical examinees. In this report appropriateness measurement procedures are viewed as statistical tests for choosing between a null hypothesis of normal test-taking behavior and an alternative hypothesis of atypical test-taking behavior. Most powerful tests for inappropriateness are described together with methods for computing their power. A recursion greatly simplifying the calculation of optimal test statistics is described and illustrated.The work reported in this article was supported by United States Office of Naval Research contracts N00014-79C-0752, NR 154-445 and N00014-83K-0397, NR 150-518, Michael V. Levine, Principal Investigator.  相似文献   

2.
The problem of predicting universe scores for samples of examinees based on their responses to samples of items is treated. A general measurement procedure is described in which multiple test forms are developed from a table of specifications and each form is administered to a different sample of examinees. The measurement model categorizes items according to the cells of such a table, and the linear function derived for minimizing error variance in prediction uses responses to these categories. In addition, some distinctions are drawn between aspects of the approach taken here and the familiar regressed score estimates.The author thanks Robert L. Brennan, Michael J. Kolen, and Richard Sawyer for helpful comments and corrections, and anonymous reviewers for suggested improvements.  相似文献   

3.
Cognitive diagnosis models of educational test performance rely on a binary Q‐matrix that specifies the associations between individual test items and the cognitive attributes (skills) required to answer those items correctly. Current methods for fitting cognitive diagnosis models to educational test data and assigning examinees to proficiency classes are based on parametric estimation methods such as expectation maximization (EM) and Markov chain Monte Carlo (MCMC) that frequently encounter difficulties in practical applications. In response to these difficulties, non‐parametric classification techniques (cluster analysis) have been proposed as heuristic alternatives to parametric procedures. These non‐parametric classification techniques first aggregate each examinee's test item scores into a profile of attribute sum scores, which then serve as the basis for clustering examinees into proficiency classes. Like the parametric procedures, the non‐parametric classification techniques require that the Q‐matrix underlying a given test be known. Unfortunately, in practice, the Q‐matrix for most tests is not known and must be estimated to specify the associations between items and attributes, risking a misspecified Q‐matrix that may then result in the incorrect classification of examinees. This paper demonstrates that clustering examinees into proficiency classes based on their item scores rather than on their attribute sum‐score profiles does not require knowledge of the Q‐matrix, and results in a more accurate classification of examinees.  相似文献   

4.
The regression framework is often the method of choice used by psychologists for predicting organizationally relevant outcomes from test scores. However, alternatives to regression exist, and these techniques may provide better prediction of outcomes and a more effective means of classifying examinees for selection and placement. This research describes two of these alternatives—decision tree methodology and optimal appropriateness measurement (OAM)—and how they were used to optimize the prediction of attrition among a sample of first-term enlisted soldiers (N?=?22,537) using a temperament inventory called the Assessment of Individual Motivation (AIM). Results demonstrated that the OAM approach provided better differentiation between “stayers” and “leavers” after 12 months than either the traditional logistic regression or the decision tree methods.  相似文献   

5.
THE EFFECTS OF REQUIRED ELABORATION OF ANSWERS TO BIODATA QUESTIONS   总被引:2,自引:0,他引:2  
The impact of a request that examinees elaborate on their answers to a subset of items in a biodata instrument was evaluated. Four forms of a test in which different subsets of items are elaborated were randomly administered to 4 groups of examinees taking a pilot form of a selection instrument for a civil service position. Results indicated significantly lower scores on items for which elaborations were requested than the items for which no elaborations were requested. Lower scores were also observed for nonelaborated items when these items were embedded among those that were elaborated, and lower scores were found when the elaborated items were presented only in the first half of the test. Although the results suggest that requiring elaborated answers may reduce scores on biodata items, several practical and theoretical questions should be investigated to determine the utility of this approach as a method of reducing socially desirable responding.  相似文献   

6.
朱玮  丁树良  陈小攀 《心理学报》2006,38(3):453-460
对IRT的双参数Logistic模型(2PLM)中未知参数估计问题,给出了一个新的估计方法――最小化χ2/EM估计。新方法在充分考虑项目反应理论(IRT)与经典测量理论(CTT)之间的差异的前提下,从统计计算的角度改进了Berkson的最小化χ2估计,取消了Berkson实施最小化χ2估计时需要已知能力参数的不合实际的前提,扩大了应用范围。实验结果表明新方法能力参数的估计结果与BILOG相比,精确度要高,且当样本容量超过2000时,项目参数的估计结果也优于BILOG。实验还表明新方法稳健性好  相似文献   

7.
In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE®General Analytical Writing and until 2009 in the case of TOEFL® iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e‐rater®. In addition, available information relevant to examinee performance may include scores on related tests. We suggest application of standard results from classical test theory to the available data to obtain best linear predictors of true traditional test scores. In performing such analysis, we require estimation of variances and covariances of measurement errors, a task which can be quite difficult in the case of tests with limited numbers of items and with multiple measurements per item. As a consequence, a new estimation method is suggested based on samples of examinees who have taken an assessment more than once. Such samples are typically not random samples of the general population of examinees, so that we apply statistical adjustment methods to obtain the needed estimated variances and covariances of measurement errors. To examine practical implications of the suggested methods of analysis, applications are made to GRE General Analytical Writing and TOEFL iBT Writing. Results obtained indicate that substantial improvements are possible both in terms of reliability of scoring and in terms of assessment reliability.  相似文献   

8.
Recently, a question was raised as to whether the multidimensionality of some professional licensing exams is due to the administration of subtests measuring conceptually distinct skills or, alternatively, strategic preparation on the part of groups of examinees attempting to cope with the demands of multiple hurdle certification systems. This article illustrates a way to investigate this issue with optimal appropriateness measurement (OAM) methods and confirmatory factor analysis (CFA). Specifically, using the former paper-and-pencil American Institute of Certified Public Accountants (AICPA) Uniform Examination as an example, OAM methods were used to identify examinees that appeared unmotivated on 2 of the 4 AICPA exam subtests. Dimensionality was studied by using CFA to compare the fit of single- and 4-factor models before and after removing flagged examinees. The results indicated that the 4-factor model provided better fit than a unidimensional model even after removing nearly 30% of respondents, thus weakening the claim that multidimensionality could be attributed solely to strategic preparation.  相似文献   

9.
In the design of common-item equating, two groups of examinees are administered separate test forms, and each test form contains a common subset of items. We consider test equating under this situation as an incomplete data problem—that is, examinees have observed scores on one test form and missing scores on the other. Through the use of statistical data-imputation techniques, the missing scores can be replaced by reasonable estimates, and consequently the forms may be directly equated as if both forms were administered to both groups. In this paper we discuss different data-imputation techniques that are useful for equipercentile equating; we also use empirical data to evaluate the accuracy of these techniques as compared with chained equipercentile equating.A paper presented at the European Meeting of the Psychometric Society, Barcelona, Spain, July, 1993.  相似文献   

10.
This study proposes a new item parameter linking method for the common-item nonequivalent groups design in item response theory (IRT). Previous studies assumed that examinees are randomly assigned to either test form. However, examinees can frequently select their own test forms and tests often differ according to examinees’ abilities. In such cases, concurrent calibration or multiple group IRT modeling without modeling test form selection behavior can yield severely biased results. We proposed a model wherein test form selection behavior depends on test scores and used a Monte Carlo expectation maximization (MCEM) algorithm. This method provided adequate estimates of testing parameters.  相似文献   

11.
在心理测量和教育测量中,二级项目和题组项目是两类常见的项目类型。由这两种项目混合构成的测试在实践中有着重要的应用。被试在答题时,由于个人的潜在能力和项目难度不匹配,常常会产生异常反应,这些异常反应会影响IRT中潜在特质估计的准确性。仿真实验证明,二级项目题组混合IRT模型的稳健估计方法在出现异常值的情况下,能够比极大似然估计对被试的潜在特质做出更加准确的估计,能够满足实际测试的需求。  相似文献   

12.
设计项目参数、被试得分已知的测验情境,在两、三、四参数Logistic加权模型下进行能力估计,发现被试得分等级之间的能力步长存在着均匀的步长间距,被试得分能较好的反映多级记分的分数加权作用。两参数Logistic加权模型下会出现被试参数估计扰动现象,猜测现象会导致能力高估现象,失误现象会导致能力低估现象;三参数Logistic加权模型c型下能力高估现象未出现或不明显;三参数Logistic加权模型γ型下能力低估现象未出现或不明显;四参数Logistic加权模型下被试能力高估现象和低估现象都未出现或不明显,四参数Logistic加权模型是被试能力稳健性估计较好的方法。  相似文献   

13.
Throughout the world, tests are administered to some examinees who are not fully proficient in the language in which they are being tested. It has long been acknowledged that proficiency in the language in which a test is administered often affects examinees’ performance on a test. Depending on the context and intended uses for a particular assessment, linguistic proficiency may be relevant to the tested construct and subsequent interpretations, or may be a source of construct-irrelevant variance that undermines accurate interpretation of the test performance of linguistic minorities who are not proficient in the language of the assessment. In this article, we highlight key validity issues to be considered when testing linguistic minorities, regardless of whether language is central or construct-irrelevant. We discuss examples of the different types of studies test users and developers could conduct to evaluate the validity of scores of linguistic minorities. These issues span test development and validation activities. We conclude with a list of critical factors to consider in test development and evaluation whenever linguistic minorities are tested.  相似文献   

14.
The following problem is considered: Given that the frequency distribution of the errors of measurement is known, determine or estimate the distribution of true scores from the distribution of observed scores for a group of examinees. Typically this problem does not have a unique solution. However, if the true-score distribution is smooth, then any two smooth solutions to the problem will differ little from each other. Methods for finding smooth solutions are developed a) for a population and b) for a sample of examinees. The results of a number of tryouts on actual test data are summarized.The writer wishes to thank Diana Lees and Virginia Lennon, who wrote the computer programs, carried out some of the mathematical derivations, and helped with other important aspects of the work. This work was supported in part by contract Nonr-2752(00) between the Office of Naval Research and Educational Testing Service. Reproduction, translation, use and disposal in whole or in part by or for the United States Government is permitted.  相似文献   

15.
The use of computer-based assessments makes the collection of detailed data that capture examinees’ progress in the tests and time spent on individual actions possible. This article presents a study using process and timing data to aid understanding of an international language assessment and the examinees. Issues regarding test-taking strategies, test speededness, test design, and their relationship to examinees’ demographic backgrounds and performance are also discussed.  相似文献   

16.
There is growing interest in organizationally provided or organizationally endorsed coaching. However, little is known about the effects of such coaching on test scores in operational settings. This study reports on an examination of such a program in the context of the use of a situational judgment test (SJT) for medical school admissions. We examine the effects of multiple types of coaching methods on SJT scores and on their construct‐related and predictive validities. Results suggest that (1) commercial coaching techniques may not be as effective as previously thought, whereas organizationally provided methods may be more effective, and that (2) the criterion‐related validity of the SJT scores is not degraded by the availability of coaching. Generally, this study illustrates that concerns about potential unfairness of coaching can be countered by making effective coaching available to all examinees, in the form of organizationally endorsed coaching.  相似文献   

17.
毛秀珍  辛涛 《心理学报》2014,46(12):1910-1922
项目曝光控制和内容约束关系到测验安全、测验的信度和效度, 是计算机化自适应测验(Computerized Adaptive Testing, CAT)中两类重要的非统计约束条件。本文在认知诊断CAT中针对内容约束和项目曝光控制要求, 运用5种方法选择测验项目。它们分别是:(1) Monte Carlo方法与项目合格方法相结合, 记为MC-IE; (2) Monte Carlo方法与最大优先指标方法相结合, 记为MC-MPI; (3) Monte Carlo方法与限制阈值方法相结合, 记为MC-RT; (4) Monte Carlo方法与限制进度指标方法相结合, 记为MC-RPG以及(5) Monte Carlo方法与最大后验概率方法相结合, 记为MC-PP。然后通过在线性、收敛、发散、无结构和独立五种属性结构下构建题库并运用重参化融融统和模型模拟被试反应比较它们的选题表现。研究发现, (1) 相同选题方法在不同属性结构下项目曝光率的分布类似, 测量精度按线性、收敛、发散、无结构和独立结构的顺序依次降低; (2) 相同属性结构下, 不同方法的测量精度高低依次为MC-PP、MC-IE、MC-RT、MC-MPI和MC-RPG方法; 项目曝光均匀性优劣依次为MC-RPG、MC-MPI、MC-RT、MC-IE和MC-PP方法。统一量纲值表明, MC-RPG方法的综合表现最好, MC-MPI方法的表现次之。  相似文献   

18.
To date, exposure control procedures that are designed to control item exposure and test overlap simultaneously are based on the assumption of item sharing between pairs of examinees. However, examinees may obtain test information from more than one examinee in practice. This larger scope of information sharing needs to be taken into account in refining exposure control procedures. To control item exposure and test overlap among a group of examinees larger than two, the relationship between the two indices needs to be identified first. The purpose of this paper is to analytically derive the relationships between item exposure rate and each of the two forms of test overlap, item sharing and item pooling, for fixed‐length computerized adaptive tests. Item sharing is defined as the number of common items shared by all examinees in a group, while item pooling is the number of overlapping items that an examinee has with a group of examinees. The accuracy of the derived relationships was verified using numerical examples. The relationships derived will lay the foundation for future development of procedures to simultaneously control item exposure and item sharing or item pooling among a group of examinees larger than two.  相似文献   

19.
20.
We examined 633 procedures that can be used to compare the variability of scores across independent groups. The procedures, except for one, were modifications of the procedures suggested by Levene (1960) and O'Brien (1981). We modified their procedures by substituting robust measures of the typical score and variability, rather than relying on classical estimators. The robust measures that we utilized were either based on a priori or empirically determined symmetric or asymmetric trimming strategies. The Levene‐type and O'Brien‐type transformed scores were used with either the ANOVA F test, a robust test due to Lee and Fung (1985), or the Welch (1951) test. Based on four measures of robustness, we recommend a Levene‐type transformation based upon empirically determined 20% asymmetric trimmed means, involving a particular adaptive estimator, where the transformed scores are then used with the ANOVA F test.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号