首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Social scientists are frequently interested in assessing the qualities of social settings such as classrooms, schools, neighborhoods, or day care centers. The most common procedure requires observers to rate social interactions within these settings on multiple items and then to combine the item responses to obtain a summary measure of setting quality. A key aspect of the quality of such a summary measure is its reliability. In this paper we derive a confidence interval for reliability, a test for the hypothesis that the reliability meets a minimum standard, and the power of this test against alternative hypotheses. Next, we consider the problem of using data from a preliminary field study of the measurement procedure to inform the design of a later study that will test substantive hypotheses about the correlates of setting quality. The preliminary study is typically called the ??generalizability study?? or ??G study?? while the later, substantive study is called the ??decision study?? or ??D study.?? We show how to use data from the G study to estimate reliability, a confidence interval for the reliability, and the power of tests for the reliability of measurement produced under alternative designs for the D study. We conclude with a discussion of sample size requirements for G studies.  相似文献   

2.
测验信度估计:从α系数到内部一致性信度   总被引:5,自引:0,他引:5  
温忠麟  叶宝娟 《心理学报》2011,43(7):821-829
沿用经典的测验信度定义, 简介了信度与a 系数的关系以及a系数的局限。为了推荐替代a系数的信度估计方法, 深入讨论了与a 系数关系密切的同质性信度和内部一致性信度。在很一般的条件下, 证明了a 系数和同质性信度都不超过内部一致性信度, 后者不超过测验信度, 说明内部一致性信度比较接近测验信度。总结出一个测验信度分析流程, 说明什么情况下a 系数还有参考价值; 什么情况下a 系数不再适用, 应当使用内部一致性信度(文献上也常称为合成信度)。提供了计算同质性信度和内部一致性信度的计算程序, 一般的应用工作者可以直接套用。  相似文献   

3.
Multiple‐choice tests are frequently used in personnel selection contexts to measure knowledge and abilities. Option weighting is an alternative multiple‐choice scoring procedure that awards partial credit for incomplete knowledge reflected in applicants’ distractor choices. We investigated whether option weights should be based on expert judgment or on empirical data when trying to outperform conventional number‐right scoring in terms of reliability and validity. To obtain generalizable results, we used repeated random sub‐sampling validation and found that empirical option weighting, but not expert option weighting, increased the reliability of a knowledge test. Neither option weighting procedure improved test validity. We recommend to improve the reliability of existing ability and knowledge tests used for personnel selection by computing and publishing empirical option weights.  相似文献   

4.
5.
GREEN BF 《Psychometrika》1950,15(3):251-257
A procedure is proposed for testing the significance of group differences in the standard error of measurement of a psychological test. Wilks' criterion is used to assure that the tests used in ascertaining reliability and hence variance of errors of measurement may be assumed parallel for each group. Votaw's criterion may be used to check whether the test scores of all the groups have the same mean, variance, and covariance. It is possible, however, for the variance and reliability of the test to differ widely from group to group, so that Votaw's criterion is not satisfied even though the variance of errors of measurement stays relatively constant. For this case a modification of Neyman and Pearson's criterion is developed to test agreement among standard errors of measurement despite group differences in mean, variance, and reliability of the test.The author wishes to acknowledge the helpful criticisms of Dr. Harold Gulliksen, who suggested the problem.  相似文献   

6.
Reliability generalization (RG) is a meta-analytic technique that allows for the systematic examination of variation in score reliability for different samples of test takers; this procedure is based on the recognition that reliability is not a stable property of a test but is sample dependent. As a demonstration of an RG analysis, I obtained 63 reliability coefficients for each of the MMPI-2 (Butcher et al., 2001) Personality Psychopathology 5 (Harkness, McNulty, & Ben-Porath, 1995) scales. The overall variability of alpha coefficients supports the argument that reliability is sample dependent and underscores the need for researchers to calculate reliability estimates based on their research samples rather than simply citing published alpha coefficients as evidence of score reliability. I observed statistically significant mean reliability differences for scores across the 5 scales, with the highest level of reliability observed for scores on the measure of Negative Emotionality and the lowest levels of reliability observed for scores on the measures of Aggression and Disconstraint. There was no evidence that the sex-composition of a sample was systematically related to score reliability, and there were no statistically significant differences in reliability between scores obtained with the English version of the test and those obtained with translated forms. However, reliability was consistently lower for scores on some scales when the data were obtained in nonclinical settings as opposed to clinical ones. Sample size was not significantly correlated with reliability estimates. RG methods have the potential for deepening the level of understanding about the role of reliability in the evaluation and use of personality tests.  相似文献   

7.
In two experiments, we investigated the creation of conceptual analogies to a contrast between vowels. An ordering procedure was used to determine the reliability of simple sensory and abstract analogies to vowel contrasts composed by naive volunteers. The results indicate that test subjects compose stable and consistent analogies to a meaningless segmental linguistic contrast, some invoking simple and complex relational properties. Although in the literature of psychophysics such facility has been explained as an effect of sensory analysis, the present studies indicate the action of a far subtler and more versatile cognitive function akin to the creation of meaning in figurative language.  相似文献   

8.
Olfactometers have been gaining popularity as research tools, but they have yet to replace established testing procedures in a variety of laboratory and clinical settings, including absolute threshold tests. In this research, we designed and operated a simple olfactometer with which to assess threshold. To do this, we used a method-of-adjustment test that was compared to the three-alternative forced choice ascending sniff bottle staircase method, which is currently a standard threshold test procedure. We found that the olfactometer threshold test correlated highly with the staircase method, and that it possessed suitable test–retest reliability. The advantages of the olfactometer threshold test include faster test time and reduced cleaning and reassembly demands. Future use of the olfactometer in olfactory identification and/or detection thresholds amongst odors is also outlined.  相似文献   

9.
叶宝娟  温忠粦 《心理科学》2013,36(3):728-733
在心理、教育和管理等研究领域中,经常会碰到两水平(两层)的数据结构,如学生嵌套在班级中,员工嵌套在企业中。在两水平研究中,被试通常不是独立的,如果直接用单水平信度公式进行估计,会高估测验信度。文献上已有研究讨论如何更准确地估计两水平研究中单维测验的信度。本研究指出了现有的估计公式的不足之处,用两水平验证性因子分析推导出一个新的信度公式,举例演示如何计算,并给出简单的计算程序。  相似文献   

10.
Becker G 《心理学方法》2000,5(3):370-379
This article introduces a procedure for estimating reliability in which equivalent halves of a given test are systematically created and then administered a few days apart so that transient error can be included in the error calculus. The procedure not only estimates complete reliability (taking into account both specific-factor error and transient error) but also can estimate partial reliability (taking into account only specific-factor error). Scores from 6 different measuring instruments were analyzed with the procedure. The results indicate that the magnitude of transient error in real data can range from nonexistent to very large. It follows that traditional reliability estimates, using nonstaggered procedures, are inflated to the extent that transient error is present.  相似文献   

11.
We present a Finland-Swedish adaptation of the Sweden-Swedish group screening test for dyslexia for adults and young adults DUVAN (Lundberg & Wolff, 2003) together with normative data from 143 Finland-Swedish university students. The test is based on the widely held phonological deficit hypothesis of dyslexia and consists of a self-report and five subtests tapping phonological working memory, phonological representation, phonological awareness, and orthographic skill. We describe the test adaptation procedure and show that the internal reliability of the new test version is comparable to the original one. Our results indicate that the language background (Swedish, Finnish, early simultaneous Swedish-Finnish bilingualism) should be taken into account when interpreting the results on the Finland-Swedish DUVAN test. We show that the FS-DUVAN differentiates a group of students with dyslexia diagnosis from normals, and that a low performance on the FS-DUVAN correlates with a positive self-report on familial dyslexia and with a history of special education in school. Finally, we analyze the sensitivity and specificity of the FS-DUVAN for dyslexia among university students.  相似文献   

12.
We developed a paper test utilizing a mechanism for measuring implicit association similar to that used in the Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998). The target concepts were buried among positive and negative words on a piece of paper. Examinees marked the targets as "bad" or "good" in one task and conversely in the other, along with the evaluative words. Instead of reaction times, we counted the number of words marked in 20 sec for each task. This procedure allowed group administration. We calculated the implicit measure using the difference in the average number of words marked in the task pairs. The results of a test administered to 82 undergraduates with three different targets showed significant correlations (rs = .26-.35) with the results of IAT administered to the same participants. It also showed significant reliability (rs = .56-.71). We discuss the practical usability of the test with application studies conducted in various areas.  相似文献   

13.
Peterson, Deary, and Austin (2003) considered the reliability of the Cognitive Styles Analysis (CSA) (Riding, 1991). The CSA seeks to assess an individual’s position on each of two fundamental style dimensions – the Wholist-Analytic and the Verbal-Imagery dimensions. It presents a series of simple cognitive tasks, which the subjects may choose to process according to their preferred style. Performance on these test items is in terms of response times. The CSA comprises 40 items to assess the Wholist-Analytic and 48 for the Verbal-Imagery and typically takes 15–20 min to complete. It is intended to be suitable for a wide age and ability range, and applicable to a variety of contexts and cultures.The most important characteristic of any test of cognitive style is its temporal stability. Studies which attempt to establish test validity without definitive evidence of test reliability are lacking a basic foundation. Riding has not published any statistical data on the test–retest reliability of the CSA.Peterson et al. (2003) and Peterson (2003) claim to have carried out the primary evaluation of the CSA’s reliability. However we were the first to publish accurate test–retest reliability data on Riding’s CSA (Redmond, Mullally, & Parkinson, 2002).This brief report addresses the issue as to who initially established the unreliability of the CSA in the first place and why Peterson, Deary and Austin’s claims are misleading and unsubstantiated.  相似文献   

14.
Classical test theory reliability coefficients are said to be population specific. Reliability generalization, a meta-analysis method, is the main procedure for evaluating the stability of reliability coefficients across populations. A new approach is developed to evaluate the degree of invariance of reliability coefficients to population characteristics. Factor or common variance of a reliability measure is partitioned into parts that are, and are not, influenced by control variables, resulting in a partition of reliability into a covariate-dependent and a covariate-free part. The approach can be implemented in a single sample and can be applied to a variety of reliability coefficients.  相似文献   

15.
The present research was conducted to establish the validity of a novel procedure for measuring human contingency judgements aimed at shortening the length of conventional procedures. Cues and outcomes were simple geometric shapes that were presented in a rapid streaming fashion, reducing the length of a block of trials from several minutes to a few seconds. We establish the reliability of the procedure by replicating two central findings in the contingency judgement literature, and we elaborate on the importance of this method for future research.  相似文献   

16.
The present research was conducted to establish the validity of a novel procedure for measuring human contingency judgements aimed at shortening the length of conventional procedures. Cues and outcomes were simple geometric shapes that were presented in a rapid streaming fashion, reducing the length of a block of trials from several minutes to a few seconds. We establish the reliability of the procedure by replicating two central findings in the contingency judgement literature, and we elaborate on the importance of this method for future research.  相似文献   

17.
Estimation of the reliability of ratings   总被引:9,自引:0,他引:9  
A procedure for estimating the reliability of sets of ratings, test scores, or other measures is described and illustrated. This procedure, based upon analysis of variance, may be applied both in the special case where a complete set of ratings from each ofk sources is available for each ofn subjects, and in the general case wherek 1,k 2, ...,k n ratings are available for each of then subjects. It may be used to obtain either a unique estimate or a confidence interval for the reliability of either the component ratings or their averages. The relations of this procedure to others intended to serve the same purpose are considered algebraically and illustrated numerically.The writer wishes to acknowledge the helpful comments and suggestions of Professors E. E. Cureton, Harold Gulliksen, and E. F. Lindquist.  相似文献   

18.
A covariance structure modelling method for the estimation of reliability for composites of congeneric measures in test–retest designs is outlined. The approach also allows an approximate standard error and confidence interval for scale reliability in such settings to be obtained. The procedure further permits measurement error components due to possible transient condition influences to be accounted for and evaluated, and is illustrated with a pair of examples.  相似文献   

19.
In the classical test theory, a high-reliability test always leads to a precise measurement. However, when it comes to the prediction of test scores, it is not necessarily so. Based on a Bayesian statistical approach, we predicted the distributions of test scores for a new subject, a new test, and a new subject taking a new test. Under some reasonable conditions, the predicted means, variances, and covariances of predicted scores were obtained and investigated. We found that high test reliability did not necessarily lead to small variances or covariances. For a new subject, higher test reliability led to larger predicted variances and covariances, because high test reliability enabled a more accurate prediction of test score variances. Regarding a new subject taking a new test, in this study, higher test reliability led to a large variance when the sample size was smaller than half the number of tests. The classical test theory is reanalyzed from the viewpoint of predictions and some suggestions are made.  相似文献   

20.
有两种方法可以估计多维测验合成信度的置信区间:Bootstrap法和Delta法.本文用模拟研究比较这两种方法,结果发现,Delta法与Bootstrap法得到结果的差异很小.因为Bootstrap法得到的是实证结果,通常被认为是真值的反映,而Delta法比Bootstrap法简单得多,所以可以用Delta法估计合成信度的置信区间.举例演示如何计算多维测验的合成信度以及用Delta法计算其置信区间.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号