首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
叶宝娟  温忠麟 《心理学报》2012,44(12):1687-1694
在决定将多维测验分数合并成测验总分时, 应当考虑测验同质性。如果同质性太低, 合成总分没有什么意义。同质性高低可以用同质性系数来衡量。用来计算同质性系数的模型是近年来受到关注的双因子模型(既有全局因子又有局部因子), 测验的同质性系数定义为测验分数方差中全局因子分数方差所占的比例。本文用Delta法推导出计算同质性系数的标准误公式, 进而计算其置信区间。提供了简单的计算同质性系数及其置信区间的程序。用一个例子说明如何估计同质性系数及其置信区间, 通过模拟比较了用Delta法和用Bootstrap法计算的置信区间, 发现两者差异很小。  相似文献   

2.
In long‐term studies of psychological development, the initial assessment of etiologically significant child behaviours is often carried out at a single point in time only. However, one‐time assessments of behaviour are likely to possess limited reliability, leading to attenuated longitudinal correlation coefficient magnitudes. How much this bias might have affected behavioural continuity estimates in longitudinal research is presently unknown. Using a data set from the Mauritius Child Health Project, we particularize the attenuating effects of single‐occasion behavioural assessments on consistency estimates of impulsive–aggressive behaviour over time. Specifically, two nursery teachers provided 15 consecutive weekly ratings of the aggressive behaviour of 99 four‐year‐old children. The same children were reassessed for the presence of externalizing behaviour problems at the ages of 8 and 10. There were substantial increases in both reliability and predictive correlation coefficient magnitudes when the preschool scores were aggregated across several weekly ratings. A further increase resulted after the two outcome assessments were combined into a composite score of school‐age externalizing symptoms. A generalized procedure, developed from the correction for attenuation formula, is introduced to describe the relation of aggregation to predictive validity in longitudinal research.  相似文献   

3.
Suppose one has a battery of K subtests and a composite for the battery is defined as the mean of the K standardized subtest scores. An individual's single-subtest deviation score is the difference between the individual's score on any single subtest and his composite score. A cluster deviation score is the difference between an examinee's average for a small set (cluster) of subtests and his composite. Formulas are given for the test of statistical significance of the individual's subtest or cluster deviation score and the internal consistency reliability of such deviation scores.  相似文献   

4.
This article introduces new statistics for evaluating score consistency. Psychologists usually use correlations to measure the degree of linear relationship between 2 sets of scores, ignoring differences in means and standard deviations. In medicine, biology, chemistry, and physics, a more stringent criterion is often used: the extent to which scores are identically equal. For each test taker (or other unit of measurement), the difference between the 2 scores is calculated. The root mean square difference (RMSD) represents the average change from 1 set of scores to the other, and the concordance correlation coefficient (CCC) rescales this coefficient to have a maximum value of 1. This article shows the relationship of the RMSD and CCC to the intraclass correlation coefficients, product-moment correlation, and standard error of measurement. Finally, this article adapts the RMSD and the CCC for linear, consistency, and absolute definitions of agreement.  相似文献   

5.
The true intra‐individual change model is generalized by defining individual method effects. This allows the analysis of non‐congeneric test–retest variables assumed to measure a common, possibly (temporally) transient, attribute. Temporal change in the attribute between different times of measurement is modelled by the true‐change variable. Individual causal method effects, due to heterogeneity of the measurement methods, account for the imperfect correlation of the true‐score variables at each time of measurement. The reliability of the composite scores, at each time of measurement, and the reliability of the difference composite score may be estimated with appropriate coefficients derived from the model. Measurements of daily life tension in adult females serve to illustrate how the model can be used empirically.  相似文献   

6.
This paper studies the asymptotic distributions of three reliability coefficient estimates: Sample coefficient alpha, the reliability estimate of a composite score following a factor analysis, and the estimate of the maximal reliability of a linear combination of item scores following a factor analysis. Results indicate that the asymptotic distribution for each of the coefficient estimates, obtained based on a normal sampling distribution, is still valid within a large class of nonnormal distributions. Therefore, a formula for calculating the standard error of the sample coefficient alpha, recently obtained by van Zyl, Neudecker and Nel, applies to other reliability coefficients and can still be used even with skewed and kurtotic data such as are typical in the social and behavioral sciences.This research was supported by grants DA01070 and DA00017 from the National Institute on Drug Abuse and a University of North Texas faculty research grant. We would like to thank the Associate Editor and two reviewers for suggestions that helped to improve the paper.  相似文献   

7.
The ordinary gain score, g, is defined as g = x2-x1, where x1 is the pretest score and x2 is the posttest score. The present study extends and refines previous research on the reliability and validity of gain scores. Using particular values as stated in the tables and graphs, the pre- and posttest reliabilities, pre- and posttest validities, ratios of pretest to posttest standard deviations, and correlations between the pretest and posttest were varied systematically to examine the effects of these parameter configurations on gain scores' reliability and validity. Results plotted graphically provide insight via visual interpretation not easily inferred using only values from a table. One interesting finding was that the reliability of a gain score can be at a maximum when the validity is at a minimum. Another is that a high correlation between pre- and posttest was beneficial to the validity of the gain score but detrimental to its reliability. By identifying the situations in which gain scores can be reliable and valid, findings inform researchers when gain scores should or should not be used.  相似文献   

8.
In the theory of test validity it is assumed that error scores on two distinct tests, a predictor and a criterion, are uncorrelated. The expected-value concept of true score in the calssical test-theory model as formulated by Lord and Novick, Guttman, and others, implies mathematically, without further assumptions, that true scores and error scores are uncorrelated. This concept does not imply, however, that error scores on two arbitrary tests are uncorrelated, and an additional axiom of “experimental independence” is needed in order to obtain familiar results in the theory of test validity. The formulas derived in the present paper do not depend on this assumption and can be applied to all test scores. These more general formulas reveal some unexpected and anomalous properties of test validty and have implications for the interpretation of validity coefficients in practice. Under some conditions there is no attenuation produced by error of measurement, and the correlation between observed scores sometimes can exceed the correlation between true scores, so that the usual correction for attenuation may be inappropriate and misleading. Observed scores on two tests can be positively correlated even when true scores are negatively correlated, and the validity coefficient can exceed the index of reliability. In some cases of practical interest, the validity coefficient will decrease with increase in test length. These anomalies sometimes occur even when the correlation between error scores is quite small, and their magnitude is inversely related to test reliability. The elimination of correlated errors in practice will not enhance a test's predictive value, but will restore the properties of the validity coefficient that are familiar in the classical theory.  相似文献   

9.
Meta-interpretive reliability is a new method to evaluate the accuracy with which personality trait scores are communicated via interpretive statements in a computer-based test interpretation (CBTI). The prototypic experimental design is based on a two-way repeated measures analysis of variance (ANOVA); the two effects are personality traits and randomly chosen CBTI protocols. In this application, 101 psychologists read four examples of the Karson Clinical Report (KCR, Karson & O'Dell, 1975) and estimated the original trait scores from the Sixteen Personality Factor Questionnaire (16PF; Cattell, Eber, & Tatsuoka, 1970) on which the KCR is based. Estimated trait score variance was significantly related to the Trait x Protocol interaction and the main effects for personality trait and differences among protocols (omega 2 = .55). The total effect size corresponded to a multiple correlation of .74, suggesting that the KCR had acceptable meta-interpretive reliability. The protocol effect denoted a context effect created by the juxtaposition of several interpretive statements. Additional analyses showed that individual differences among raters contributed to less than 1% of the estimated standard ten (sten) score variance. Meta-interpretive reliability is proposed as an index of the upper limit of validity for CBTIs.  相似文献   

10.
本文提出差异分数的信度变化问题,并以模拟数据分析了差异分数的信度在不同情况下的变化规律。结果指出:1.当两次测试得分的信度系数相等或相近时,两次测试的标准差相差越大,差异分数的信度越高。2.当两次测试得分的信度系数不等时,只要两次施测中任何一次的信度和标准差同时大于另外一次,那么差异分数的信度也比较高。3.无论两次测试的信度关系如何。两次测试相关越低,差异分数的信度越高。  相似文献   

11.
Posner’s attention network model and Bundesen’s theory of visual attention (TVA) are two influential accounts of attention. Each model has led to the development of a test method: the attention network test (ANT) and TVA-based assessment, respectively. Both tests have been widely used to investigate attentional function in normal and clinical populations. Here we report on the first direct comparison of the ANT to TVA-based assessment. A group of 68 young healthy participants were tested in three consecutive sessions that each contained standard versions of the two tests. The parameters derived from TVA-based assessment had better internal reliability and retest reliability than did those of the standard version of the ANT, where only the executive network score reached comparable levels. However, when corrected for differences in test length, the retest reliability of the orienting network score equaled the least reliable TVA parameters. Both tests were susceptible to practice effects, which improved performance for some parameters while leaving others constant. All pairwise correlations between the eight attention parameters measured by the two tests were small and nonsignificant, with one exception: A strong correlation (r?=?0.72) was found between two parameters of TVA-based assessment, visual processing speed and the capacity of visual short-term memory. We conclude that TVA-based assessment and the ANT measure complementary aspects of attention, but the scores derived from TVA-based assessment are more reliable.  相似文献   

12.
Although difference scores are widely used in classifying children as learning-disabled, their psychometric properties are often not well understood. Such scores generally contain more error than single test scores. Reliability and standard error of measurement figures for several combinations of ability and achievement measures are presented. The rates and types of errors that occur when such scores are used to classify children as learning-disabled are discussed. Three recommendations for using difference scores are given: (a) combinations of ability and achievement tests that yield difference score reliabilities higher than .80 should be used when classifying children; (b) scores should be reported as a band of scores (± one standard error of measurement) to inform decision-makers regarding the amount of error estimated to be in the score, and (c) the criterion score for classifying the learning disabled should be set after consideration of the rate and types of errors likely to occur.  相似文献   

13.
It is shown that the population-covariance matrix of a heterogeneous factor model may be indistinguishable from that of a standard factor model and that the standard likelihood-ratio goodness-of-fit statistic has but little power in detecting loading heterogeneity. The relation between loading heterogeneity and factor score reliability is studied and it is recommended that non-normality of the test-score distributions be tested to use factor scores with more confidence. Substantive justifications for the model assumptions and model-based methods to test specific hypotheses about the loading distribution, are discussed.  相似文献   

14.
Critics of Kinesthetic Aftereffect (KAE) recommend abandoning it as a personality measure largely because of poor test-retest reliability. Although no test can be valid if lacking true reliability, to discard a measure because of poor retest reliability is an oversimplification of validation procedures. This pitfall is exemplified here by a reexamination of KAE. KAE scores involve measures before (pretest) and after (test) aftereffect induction. Internal analysis of a KAE study showed: Differential bias is present; its locus is the second session pretest; its form makes second-session pretest scores functionally more similar to first- and second-session test scores and functionally more dissimilar to first-session pretest scores. Given this second session bias, the retest correlation tells us nothing about the true reliability of a one-session KAE score. However, if a measure possesses external validity, it must to some degree show true reliability. Based upon a literature review of one-session KAE validity studies, we conclude that one-session KAE scores are valid and hence show true reliability. KAE remains a promising personality measure.  相似文献   

15.
HORST P 《Psychometrika》1948,13(3):125-134
A battery of pencil-and-paper tests is commonly used for predicting a single criterion. If the score on each test is the number of correct answers, the composite battery score would normally be the sum of the weighted test scores, where the weights are the raw score regression weights. Knowing the reliability of each test, it is possible to alter the lengths of the tests in a manner such that the weights will all be equal. The composite battery score would then simply be the total number of items answered correctly and scoring would be greatly simplified. Such simplification is particularly desirable where the volume of testing is large. Section I of the article outlines the procedure for altering the lengths of the tests, and Section II gives a proof of the method.  相似文献   

16.
This study compares two methods commonly used (concordance and prediction) to establish linkages between scores from tests of similar content given in different languages. Score linkages between the Verbal and Math sections of the SAT I and the corresponding sections of the Spanish-language admissions test, the Prueba de Aptitud Academica (PAA), are used to illustrate the issues. The results indicate that for both single and composite score linkage, prediction is preferable to concordance. A comparison of prediction and concordance results for composite scores versus single scores indicates that when verbal scores were added to math scores, the prediction for the resultant composite score is better than that obtained for verbal scores alone but worse than that obtained for math scores alone.  相似文献   

17.
Perfectionism has been identified as a common concern among clients who seek counseling services. For more than 20 years, the Frost Multidimensional Perfectionism Scale (F-MPS) has been used extensively to measure the construct of individuals' perfectionism. The current study used reliability generalization to identify the average score reliability as well as variables explaining the variability of score reliability. Typical reliability across subscale scores ranged from .71 to.86 with the Doubt about Action subscale showing the least variability and the Organization subscale showing the most. In addition, sex, language, and standard deviation of the scale had statistically significant relations to reliability estimates.  相似文献   

18.
Derivations are presented relating the length of a test to its weight in a composite. Tests of varying length are constructed so that their weights will be of predetermined magnitudes, and the results compared with expectations. Weighting schemes involving standard deviations of raw scores and of true scores are compared. An important secondary derivation is presented from which it is possible to estimate test reliability knowing only the relative length of a test, its shortened form, and the standard deviation of each.  相似文献   

19.
For item response theory (IRT) models, which belong to the class of generalized linear or non‐linear mixed models, reliability at the scale of observed scores (i.e., manifest correlation) is more difficult to calculate than latent correlation based reliability, but usually of greater scientific interest. This is not least because it cannot be calculated explicitly when the logit link is used in conjunction with normal random effects. As such, approximations such as Fisher's information coefficient, Cronbach's α, or the latent correlation are calculated, allegedly because it is easy to do so. Cronbach's α has well‐known and serious drawbacks, Fisher's information is not meaningful under certain circumstances, and there is an important but often overlooked difference between latent and manifest correlations. Here, manifest correlation refers to correlation between observed scores, while latent correlation refers to correlation between scores at the latent (e.g., logit or probit) scale. Thus, using one in place of the other can lead to erroneous conclusions. Taylor series based reliability measures, which are based on manifest correlation functions, are derived and a careful comparison of reliability measures based on latent correlations, Fisher's information, and exact reliability is carried out. The latent correlations are virtually always considerably higher than their manifest counterparts, Fisher's information measure shows no coherent behaviour (it is even negative in some cases), while the newly introduced Taylor series based approximations reflect the exact reliability very closely. Comparisons among the various types of correlations, for various IRT models, are made using algebraic expressions, Monte Carlo simulations, and data analysis. Given the light computational burden and the performance of Taylor series based reliability measures, their use is recommended.  相似文献   

20.
刘玥  刘红云 《心理学报》2017,(9):1234-1246
双因子模型可以同时包含一个全局因子和多个局部因子,在描述多维测验结构时有其独特优势,近些年应用越来越广泛。文章基于双因子模型,提出了4种合成总分和维度分的方法,分别是:原始分法,加和法,全局题目加权加和法和局部题目加权加和法,并采用模拟的方法,在样本量、测验长度、维度间相关变化的条件下考察了这些方法与传统多维IRT方法的表现。最后,通过实证研究对结果进行了验证。结果显示:(1)全局加权加和法和局部加权加和法,尤其是局部加权加和法合成的总分和维度分与真值最接近、信度最高。(2)在维度间相关较高,测验长度较长的条件下,局部加权加和法的结果较好,部分条件下甚至优于多维IRT法。(3)仅有局部加权加和法合成的维度分能够反应维度间真实的相关关系。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号