首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
在MCAT中考查四种项目选择指标在有无曝光控制条件下的选题表现。项目选择指标分别是:(1)贝叶斯的D优化方法(D-optimality)、后验期望Kullback-Leibler方法(KLP)、基于等权重复合分数的最小误差方差方法(the minimized error variance of the linear combination score with equal weight,V1)和基于最优权重复合分数的最小误差方差方法(the minimized error variance of the composite score with optimized weight,V2)。将针对认知诊断CAT项目曝光控制的的限制阈值方法(Restrictive Threshold,RT)和限制进度(Restrictive Progressive,RPG)方法、单维CAT中的最大优先指标方法(Maximum Priority Index,MPI)推广到MCAT。模拟研究表明:(1)KLP,D-优化和V1对领域分数估计准确,能力返真性比V2更好。(2)尽管V1和V2方法相比KLP和D-优化方法提高了题库利用率,但这四种选题指标都产生不均匀的项目曝光率分布。(2)三种曝光控制策略都极大地提高项目曝光均匀性,且不明显降低测量精度。(3)MPI与RPG方法在曝光控制方面表现类似,且比RT的方法表现更好。  相似文献   

2.
This research examines the processes respondents use to answer personality test items. A total of 158 true/false items from four scales of the Personality Research Form and the California Psychological Inventory were used as stimuli. University students (N = 120) responded to each item and indicated one of nine strategies used in deciding on a response. Obtained response strategy ratings for items were reliable and their frequencies corresponded closely to previous findings with other items. Subsequently, the associations between item response strategy frequencies and item-total correlations were computed. Congruent with previous research, better items avoided behaviours or experiences and evoked responding based on traits and on referring to the statements of others. The associations between item response strategies and other indices of item quality are discussed and implications regarding scale development are offered.  相似文献   

3.
Simulations were conducted to examine the effect of differential item functioning (DIF) on measurement consequences such as total scores, item response theory (IRT) ability estimates, and test reliability in terms of the ratio of true-score variance to observed-score variance and the standard error of estimation for the IRT ability parameter. The objective was to provide bounds of the likely DIF effects on these measurement consequences. Five factors were manipulated: test length, percentage of DIF items per form, item type, sample size, and level of group ability difference. Results indicate that the greatest DIF effect was less than 2 points on the 0 to 60 total score scale and about 0.15 on the IRT ability scale. DIF had a limited effect on the ratio of true-score variance to observed-score variance, but its influence on the standard error of estimation for the IRT ability parameter was evident for certain ability values.  相似文献   

4.
The authors modeled sources of error variance in job specification ratings collected from 3 levels of raters across 5 organizations (N=381). Variance components models were used to estimate the variance in ratings attributable to true score (variance between knowledge, skills, abilities, and other characteristics [KSAOs]) and error (KSAO-by-rater and residual variance). Subsequent models partitioned error variance into components related to the organization, position level, and demographic characteristics of the raters. Analyses revealed that the differential ordering of KSAOs by raters was not a function of these characteristics but rather was due to unexplained rating differences among the raters. The implications of these results for job specification and validity transportability are discussed.  相似文献   

5.
Patterns of ratings using the Q-Sort method and the Likert-type method are compared. Ordering effects are found in Q-Sort ratings that are not present in Likert-type ratings. Specifically, item order is related to both item variance and item placement, such that items appearing near the end of the Q-Sort have less variance and more central placement. This finding is verified across three measures in several datasets spanning nearly 20 years of research. Such item order effects appear to attenuate average absolute relationships (covariances and correlations) between items appearing near the end of the Q-Sort and other measures. Randomization of items may be (in some situations) a viable course of action to minimize these effects at a sample level.  相似文献   

6.
本研究用中文修订版罗森博格自尊量表(RSES-R)考察随机截距因子分析模型在控制条目表述效应时的表现。用RSES-R和过分宣称问卷组成的量表调查621名中学生。结果表明,随机截距模型在建模时,拟合指数良好、因子方差与负荷合理,自尊因子分与RSES-R总分有极高相关,表明该模型能有效分离RSES-R得分的特质与表述效应。分离的表述效应因子分与受测者的自我提升水平具有显著但较弱的相关,表明表述效应与自受测者的社会赞许性有共同的成分。  相似文献   

7.
Criterion measures are frequently obtained by averaging ratings, but the number and kind of ratings available may differ from individual to individual. This raises issues as to the appropriateness of any single regression equation, about the relation of variance about regression to number and kind of criterion observations, and about the preferred estimate of regression parameters. It is shown that if criterion ratings all have the same true score the regression equation for predicting the average is independent of the number and kind of criterion scores averaged.Two cases are distinguished, one where criterion measures are assumed to have the same true score, and the other where criterion measures have the same magnitude of error of measurement as well. It is further shown that the variance about regression is a function of the number and kind of criterion ratings averaged, generally decreasing as the number of measures averaged increases. Maximum likelihood estimates for the regression parameters are derived for the two cases, assuming a joint normal distribution for predictors and criterion average within each subpopulation of persons for whom the same type of criterion average is available.  相似文献   

8.
Using a field sample of peers and subordinates, the current study employed generalizability theory to estimate sources of systematic variability associated with both developmental and administrative ratings (variance due to items, raters, etc.) and then used these values to estimate the dependability (i.e., reliability) of the performance ratings under various conditions. Results indicated that the combined rater and rater-by-ratee interaction effect and the residual effect were substantially larger than the person effect (i.e., object of measurement) for both rater sources across both purpose conditions. For subordinates, the person effect accounted for a significantly greater percentage of total variance in developmental ratings than in administrative ratings; however, no differences were observed for peer ratings as a function of rating purpose. These results suggest that subordinate ratings are of significantly better quality when made for developmental than for administrative purposes, but the same is not true for peer ratings.  相似文献   

9.
Lihua Yao 《Psychometrika》2012,77(3):495-523
Multidimensional computer adaptive testing (MCAT) can provide higher precision and reliability or reduce test length when compared with unidimensional CAT or with the paper-and-pencil test. This study compared five item selection procedures in the MCAT framework for both domain scores and overall scores through simulation by varying the structure of item pools, the population distribution of the simulees, the number of items selected, and the content area. The existing procedures such as Volume (Segall in Psychometrika, 61:331?C354, 1996), Kullback?CLeibler information (Veldkamp & van?der Linden in Psychometrika 67:575?C588, 2002), Minimize the error variance of the linear combination (van?der Linden in J. Educ. Behav. Stat. 24:398?C412, 1999), and Minimum Angle (Reckase in Multidimensional item response theory, Springer, New York, 2009) are compared to a new procedure, Minimize the error variance of the composite score with the optimized weight, proposed for the first time in this study. The intent is to find an item selection procedure that yields higher precisions for both the domain and composite abilities and a higher percentage of selected items from the item pool. The comparison is performed by examining the absolute bias, correlation, test reliability, time used, and item usage. Three sets of item pools are used with the item parameters estimated from real live CAT data. Results show that Volume and Minimum Angle performed similarly, balancing information for all content areas, while the other three procedures performed similarly, with a high precision for both domain and overall scores when selecting items with the required number of items for each domain. The new item selection procedure has the highest percentage of item usage. Moreover, for the overall score, it produces similar or even better results compared to those from the method that selects items favoring the general dimension using the general model (Segall in Psychometrika 66:79?C97, 2001); the general dimension method has low precision for the domain scores. In addition to the simulation study, the mathematical theories for certain procedures are derived. The theories are confirmed by the simulation applications.  相似文献   

10.
This study found mixed support for the hypothesis that the difference in criterion-related validity between unstructured and structured employment interviews is due solely to the greater reliability of structured interviews. Using data from prior meta-analyses, this hypothesis was tested in 4 data sets by using standard psychometric procedures to remove the effects of measurement error in interview scores from correlations with rated job performance and training performance. In the 1st data set. support was found for this hypothesis. However, in a 2nd data set structured interviews had higher true score correlations with performance ratings, and in 2 other data sets unstructured interviews had higher true score correlations. We also found that averaging across 3 to 4 independent unstructured interviews provides the same level of validity for predicting job performance as a structured interview administered by a single interviewer. Practical and theoretical implications are discussed.  相似文献   

11.
A database integrating 90 years of empirical studies reporting intercorrelations among rated job performance dimensions was used to test the hypothesis of a general factor in job performance. After controlling for halo error and 3 other sources of measurement error, there remained a general factor in job performance ratings at the construct level accounting for 60% of total variance. Construct-level correlations among rated dimensions of job performance were substantially inflated by halo for both supervisory (33%) and peer (63%) intrarater correlations. These findings have important implications for the measurement of job performance and for theories of job performance.  相似文献   

12.
Standard factorial designs in psycholinguistics have been complemented recently by large-scale databases providing empirical constraints at the level of item performance. At the same time, the development of precise computational architectures has led modelers to compare item-level performance with item-level predictions. It has been suggested, however, that item performance includes a large amount of undesirable error variance that should be quantified to determine the amount of reproducible variance that models should account for. In the present study, we provide a simple and tractable statistical analysis of this issue. We also report practical solutions for estimating the amount of reproducible variance for any database that conforms to the additive decomposition of the variance. A new empirical database consisting of the word identification times of 140 participants on 120 words is then used to test these practical solutions. Finally, we show that increases in the amount of reproducible variance are accompanied by the detection of new sources of variance.  相似文献   

13.
This study investigated the psychometric properties of three methods of scoring a Mixed Standard Scale (MSS) performance evaluation: the patterned procedure as corrected by Saal (1979); a simple nonpatterned scoring procedure suggested by Prien, Jones, and Miller (1977), which gives equal weights to the performance statements; and a procedure that assigned differential weights to each statement on the basis of scale values provided by a panel of subject matter experts. Interrater reliabilities, scale variances for averaged ratings, and a convergent/discriminant validity analysis, which included an alternate method of job skill ratings, indicated no difference in the score distribution variance, interrater reliability, or validity of different method scores.  相似文献   

14.
Relations between constructs are estimated based on correlations between measures of constructs corrected for measurement error. This process assumes that the true scores on the measure are linearly related to construct scores, an assumption that may not hold. We examined the extent to which differences in distribution shape reduce the correlation between true scores on a measure and scores on the underlying construct they are intended to measure. We found, via a series of Monte Carlo simulations, that when the actual construct distribution is normal, nonnormal distributions of true scores caused this correlation to drop by an average of only .02 across 15 conditions. When both construct and true score distributions assumed different combinations of nonnormal distributions, the average correlation was reduced by .05 across 375 conditions. We conclude that theory‐based scales intended to measure constructs usually correlate highly with the constructs they are constructed to measure. We show that, as a result, in most cases true score correlations only modestly underestimate correlations between different constructs. However, in cases in which the two constructs are redundant, this underestimation can lead to the false conclusion that the constructs are ‘correlated but distinct constructs,’ resulting in construct proliferation.  相似文献   

15.
《人类行为》2013,26(1):19-35
Investigations of the construct-related evidence of the validity of performance ratings have been rare, perhaps because researchers are dissuaded by the con- siderable amount of evidence needed to show construct validity (Landy, 1986). It is argued that generalizability (G) theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972) is well-suited to investigations of construct-related evi- dence of validity because a single generalizability investigation may provide multiple inferences of validity. G theory permits the researcher to partition observed score variance into universe (true) score variance and multiple, distinct estimates of error variance. G theory was illustrated through the anal- ysis of proficiency ratings of 256 Air Force jet engine mechanics. Mechanics were rated on three different rating forms by themselves, peers, and supervi- sors. Interpretation of G study variance components revealed suitable evi- dence of construct validity. Ratings within sources were reliable. Proficiency ratings showed strong convergence over rating forms, though not over rating sources. Raters showed adequate discriminant validity across rating dimen- sions. The expectation of convergence over sources was further questioned.  相似文献   

16.
Spieler and Balota (1997) showed that connectionist models of reading account for relatively little item-specific variance. In assessing this finding, it is important to recognize two factors that limit how much variance such models can possibly explain. First, item means are affected by several factors that are not addressed in existing models, including processes involved in recognizing letters and producing articulatory output. These limitations point to important areas for future research but have little bearing on existing theoretical claims. Second, the item data include a substantial amount of error variance that would be inappropriate to model. Issues concerning comparisons between simulation data and human performance are discussed with an emphasis on the importance of evaluating models at a level of specificity ("grain") appropriate to the theoretical issues being addressed.  相似文献   

17.
This paper describes a study examining the impact of item order in personality measurement on reliability, measurement equivalence and scale-level correlations. A large sample of university students completed one of three forms of the International Personality Item Pool version of the Big Five personality inventory: items sorted at random, items sorted by factor, and items cycled through factors. Results showed that the underlying measurement model and the internal consistency of the IPIP-Big Five scale was unaffected by differences in item order. Also, most of the scale-level correlations among factors were not significantly different across forms. Implications for the administration of tests and interpretation of test scores are discussed, and future research directions are offered.  相似文献   

18.
The Deterministic, Gated Item Response Theory Model (DGM, Shu, Unpublished Dissertation. The University of North Carolina at Greensboro, 2010) is proposed to identify cheaters who obtain significant score gain on tests due to item exposure/compromise by conditioning on the item status (exposed or unexposed items). A “gated” function is introduced to decompose the observed examinees’ performance into two distributions (the true ability distribution determined by examinees’ true ability and the cheating distribution determined by examinees’ cheating ability). Test cheaters who have score gain due to item exposure are identified through the comparison of the two distributions. Hierarchical Markov Chain Monte Carlo is used as the model’s estimation framework. Finally, the model is applied in a real data set to illustrate how the model can be used to identify examinees having pre-knowledge on the exposed items.  相似文献   

19.

Recent trends indicate that organizations will continue their strategic pursuit of teamwork for the foreseeable future, which will create a need for accurate assessments of individuals’ performance in teams. Although individual behaviors can be perceived and assessed by fellow team members (i.e., peers), the extent to which the team shapes perceivers’ judgments versus the target’s behavior is unclear. We conducted two studies to understand how and why team context influences peer ratings of individual performance. In study 1, we conducted cross-classified modeling on a sample of 7160 performance observations of 568 targets made by 567 perceivers, who were each members of four separate teams. Results indicated that team membership accounted for a substantially higher proportion of perceiver, relative to target, variance. In study 2, we conducted social relations modeling with a sample of 679 performance observations collected from 217 individuals nested in 46 teams to test the effects of psychological safety on perceiver, target, and team variance components. Perceptions of psychological safety accounted for proportionally larger perceiver, relative to target, variance in OCB, and task performance ratings. Altogether, team context appears to affect perceivers’ judgments of behavior more than the target’s behavior itself, implying that peer ratings sourced from different teams may not be comparable. We consider the implications for the collection and interpretation of peer performance ratings in teams and the potential implications for social cognitive theory, such that certain aspects of the team context, including psychological safety, may act as a cognitive heuristic by molding perceiver judgments of targets.

  相似文献   

20.
Psychometric characteristics of the Adaptive Behavior Inventory for Children (ABIC) are analyzed through five statistical procedures (internal consistency, item difficulty, correlations of item-total correlations, concurrent validity, and construct validity) using data on 436 elementary age children from three racial-ethnic and two social class groups. Data from these five statistical procedures are reported for nine demographic characteristics: children's race, social class, sex, age, birth order, health, family size, family structure, and urban acculturation. Few systematic differences are apparent on internal consistency, correlation of item-total correlations, and construct validity. Some differences are apparent on item difficulty and concurrent validity. On item difficulty the ABIC scores are higher for middle-SES, older, first- or second-born children, and from families whose structures are more typical. Regarding concurrent validity, lower correlations are noted for Mexican American and Black, for more healthy, and for less acculturated children. ABIC-achievement correlations generally are too low to be of practical value. The results are interpreted in terms of possible test bias on the ABIC.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号