首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Lopez MN  Lazar MD  Oh S 《Assessment》2003,10(1):66-70
The psychometric properties of the Hooper Vsual Organization Test (VOT) have not been well investigated Here the authors present internal consistency and interrater reliability coefficients, and an item analysis, using data from a sample (N = 281) of "cognitively impaired" and "cognitively intact" patients, and patients with undetermined cognitive status. Coefficient alpha for the VOT total sample was .882. An item analysis found that 26 of the 30 items were good at discriminating among patients. Also, the interrater reliabilities for three raters (.992), two raters (.988), and one rater (.977) were excellent. Therefore, the judgmental scoring of the VOT does not interfere significantly with its clinical utility. The authors conclude that the VOT is a psychometrically sound test.  相似文献   

2.
A program is described for computing interrater reliability by averaging, for each rater, the correlations between one rater’s ratings and every other rater’s ratings. For situations in which raters rate more than one ratee, raters’ reliabilities can be computed for either each item or each ratee. The program reads data from a text file and puts the reliability coefficients in a text file. The standard Macintosh interface is implemented. The Quick-BASIC program is distributed both as a listing and in compiled form; it can be run with advantage with math coprocessors.  相似文献   

3.
4.
The assessment of multiliterate handwriting performance is rarely reported despite increased globalization. The present study describes the psychometric properties of a handwriting speed test developed for children who are biliterate in English and Chinese. This included interrater reliability, test-retest reliability, interitem correlation, construct validity, and concurrent validity. The test's reliabilities between two raters and over a 1-wk. interval were high with ICCs ranging from .89 to .99. Interitem correlation between the English and Chinese items was .87. The presence of age trends but not sex differences was a positive indicator of the test's validity. Correlations of .91 and 1.00 between the Chinese and the English items of the Handwriting Assessment Tool with the Chinese Handwriting Speed Test and Handwriting Speed Test, respectively, provided evidence of concurrent validity. These preliminary results showed the Handwriting Assessment Tool is reliable and is a potentially useful handwriting test for children biliterate in English and Chinese. The feasibility of assessing biliterate handwriting speed performance with the same set of scoring criteria for different writing systems was supported.  相似文献   

5.
This study examined the short-interval test-retest reliability of the Structured Clinical Interview (SCID-II: First, Spitzer, Gibbon, & Williams, 1995) for DSM-IV personality disorders (PDs). The SCID-II was administered to 69 in- and outpatients on two occasions separated by 1 to 6 weeks. The interviews were conducted at three sites by ten raters. Each rater acted as first and as second rater and equal number of times. The test-retest interrater reliability for the presence or absence of any PD was fair to good (kappa = .63) and was higher than values found in previous short-interval test-retest studies with the SCID-II for DSM-III-R. Test-retest reliability coefficients for trait and sumscores were sufficient, except for dependent PD. Values for single criteria were variable, ranging from poor to good agreement. Further large-scale test-retest research is needed to test the interrater reliability of more categorical diagnoses and single traits.  相似文献   

6.
Previous research on measurement error in job performance ratings estimated reliability using coefficients: alpha, test–retest, and interrater correlation. None of these three coefficients control for the four main sources of error in performance ratings. For this reason, coefficient of equivalence and stability (CES) has been suggested as the ideal estimate of reliability. This article presents the estimates of CES for a time interval of 1, 2, and 3 years. The values obtained for a single rater were .51, .48, and .44, respectively. For two raters, the values were .59, .55, and .51. The findings suggest that previous reliability estimates based on alpha, test–retest, and interrater coefficients overestimated the reliability of job performance ratings. In the present study, the interrater coefficient overestimates reliability by 13.6–25.4% for an interval time of 1–3 years, as it does not control for transient error. Results also showed that the importance of transient error increases as the length of the interval between the measures increases. Based on the results, it is suggested that corrected validities based on interrater reliability underestimate the magnitude of the validity. The implications of these findings for future efforts to estimate criterion reliability and predictor validity are discussed.  相似文献   

7.
In an effort to duplicate high interrater reliability coefficients reported in the use of Epley and Ricks' (1963) time orientation scoring system with the Thematic Apperception Test (TAT), two pairs of judges and two different training procedures were employed. Reliability coefficients considerably lower than those quoted by other researchers were found. One method of using the system was to have judges discuss scoring differences during training and at various times during a research project until perfect agreement was reached. When used as an adjunct with periodic assessment' of reliability as judges scored a large number of stories, reliability coefficients within a range acceptable for research purposes were obtained. This procedure is presented with correlational evidence for the presence of the time factor that the scoring system purports to measure.  相似文献   

8.
Protocols from 110 evaluations utilizing the Wechsler Intelligence Scale for Children-Third Edition (WISC-III) and the Woodcock/Johnson Tests of Achievement-Revised (W/J-R) were scored by two different raters to determine (a) whether subtests with more difficult levels of scoring yield lower interrater correlation coefficients, (b) whether scoring errors on subtests affect broad score estimates, (c) the effect of expertise of rater on scoring errors, and (d) whether scoring errors affect a learning disability determination based on IQ/achievement discrepancy. Scoring errors were found on almost 25% of Comprehension and Vocabulary subtests; however, the effect of these scoring errors was minimal. About 42% of Writing Samples subtests had scoring errors, resulting in a mean change of 1.75 points on the Broad Written Language Cluster subtest. On the WISC-III, but not the W/J-R, there were significantly more errors made by inexperienced testers. Scoring errors resulted in two cases in which learning disability determination would be changed. Overall, the study corroborates previous findings of strong interrater reliability on most subtests of common IQ and achievement tests and indicates that novice scorers are not likely to make scoring mistakes that will significantly impact an IQ/achievement discrepancy-based documentation of learning disability.  相似文献   

9.
Interrater correlations are widely interpreted as estimates of the reliability of supervisory performance ratings, and are frequently used to correct the correlations between ratings and other measures (e.g., test scores) for attenuation. These interrater correlations do provide some useful information, but they are not reliability coefficients. There is clear evidence of systematic rater effects in performance appraisal, and variance associated with raters is not a source of random measurement error. We use generalizability theory to show why rater variance is not properly interpreted as measurement error, and show how such systematic rater effects can influence both reliability estimates and validity coefficients. We show conditions under which interrater correlations can either overestimate or underestimate reliability coefficients, and discuss reasons other than random measurement error for low interrater correlations.  相似文献   

10.
Scores on the Clock Drawing Test have long been considered a useful screening tool for neuropsychological dysfunction, and a number of scoring methods have been developed to evaluate various aspects of performance. This study compared quantitative and qualitative scoring by briefly trained students on 145 clock drawings produced by patients in a geriatric psychiatry outpatient clinic to estimate the interrater reliability of the methods, user's acceptance of the methods, and whether the methods provide differential diagnosis. Both systems showed acceptable interrater reliability. Using the quantitative method, raters scored drawings by patients with organic mental disease as more impaired than those patients diagnosed as depressed or schizophrenic. Results suggest that the Clock Drawing Test is a reliable screening tool for cognitive impairment in a geropsychiatric population, but the scoring methods examined do not yet appear psychometrically sound enough to provide a differential diagnosis.  相似文献   

11.
The study investigated the level of agreement among graduate students (N = 14) and school psychologists (N = 18) in scoring drawings for the 10 designs on the WPPSI Geometric Design subtest. Considerable scoring disagreement occurred within each group. Unanimous agreement was found for only 11 out of 50 drawing items among the graduate students and for only 7 out of 50 drawing items among the school psychologists. While the raters were generally confident of their ratings, there also was a significant positive relationship between level of scoring agreement and confidence ratings (rho = .76, p < .05). Scoring disagreement was greater for the drawings on designs 6 through 9 than on other designs. The results suggest that careful study of the WPPSI scoring criteria is needed in order to achieve scoring proficiency.  相似文献   

12.
Reliability is one of the most important aspects of testing in educational and psychological measurement. The construction of confidence intervals for reliability coefficients has important implications for evaluating the accuracy of the sample estimate of reliability and for comparing different tests, scoring rubrics, or training procedures for raters or observers. The present simulation study evaluated and compared various parametric and non-parametric methods for constructing confidence intervals of coefficient alpha. Six factors were manipulated: number of items, number of subjects, population coefficient alpha, deviation from essentially parallel condition, item response distribution and type. The coverage and width of different confidence intervals were compared across simulation conditions.  相似文献   

13.
This paper demonstrates and compares methods for estimating the interrater reliability and interrater agreement of performance ratings. These methods can be used by applied researchers to investigate the quality of ratings gathered, for example, as criteria for a validity study, or as performance measures for selection or promotional purposes. While estimates of interrater reliability are frequently used for these purposes, indices of interrater agreement appear to be rarely reported for performance ratings. A recommended index of interrater agreement, theT index (Tinsley & Weiss, 1975), is compared to four methods of estimating interrater reliability (Pearsonr, coefficient alpha, mean correlation between raters, and intraclass correlation). Subordinate and superior ratings of the performance of 100 managers were used in these analyses. The results indicated that, in general, interrater agreement and reliability among subordinates were fairly high. Interrater agreement between subordinates and superiors was moderately high; however, interrater reliability between these two rating sources was very low. The results demonstrate that interrater agreement and reliability are distinct indices and that both should be reported. Reasons are discussed as to why interrater reliability should not be reported alone.This paper is based, in part, on a thesis submitted to East Carolina University by the second author. Portions of this study were presented at the American Psychological Association meeting in New Orleans, LA, August, 1989. The authors would like to thank Michael Campion and two anonymous reviewers for their comments on earlier drafts of this paper.  相似文献   

14.
Few group psychotherapy studies focus on therapists' interventions, and instruments that can measure group psychotherapy treatment fidelity are scarce. The aim of the present study was to evaluate the reliability of the Mentalization‐based Group Therapy Adherence and Quality Scale (MBT‐G‐AQS), which is a 19‐item scale developed to measure adherence and quality in mentalization‐based group therapy (MBT‐G). Eight MBT groups and eight psychodynamic groups (a total of 16 videotaped therapy sessions) were rated independently by five raters. All groups were long‐term, outpatient psychotherapy groups with 1.5 hours weekly sessions. Data were analysed by a Generalizability Study (G‐study and D‐study). The generalizability models included analyses of reliability for different numbers of raters. The global (overall) ratings for adherence and quality showed high to excellent reliability for all numbers of raters (the reliability by use of five raters was 0.97 for adherence and 0.96 for quality). The mean reliability for all 19 items for a single rater was 0.57 (item range 0.26–0.86) for adherence, and 0.62 (item range 0.26–0.83) for quality. The reliability for two raters obtained mean absolute G‐coefficients on 0.71 (item range 0.41–0.92 for the different items) for adherence and 0.76 (item range 0.42–0.91) for quality. With all five raters the mean absolute G‐coefficient for adherence was 0.86 (item range 0.63–0.97) and 0.88 for quality (item range 0.64–0.96). The study demonstrates high reliability of ratings of MBT‐G‐AQS. In models differentiating between different numbers of raters, reliability was particularly high when including several raters, but was also acceptable for two raters. For practical purposes, the MBT‐G‐AQS can be used for training, supervision and psychotherapy research.  相似文献   

15.
Cooke DJ  Hart SD  Michie C 《心理评价》2004,16(3):335-339
Cross-national differences in the prevalence of psychopathy have been reported. This study examined whether rater effects could account for these differences. Psychopathy was assessed with the Psychopathy Checklist-Revised (PCL-R; R. D. Hare, 1991). Videotapes of 6 Scottish prisoners and 6 Canadian prisoners were rated by 10 Scottish and 10 Canadian raters. No significant main or interaction effects involving the nationality of raters were detected at the level of full scores or factor scores. Using a generalizability theory approach, it was demonstrated that the interrater reliability of total scores was good, that is, the proportion of variance in test scores attributable to raters was small. The interrater reliability of factor scores was lower, typically falling in the fair range. Overall, the results suggest that the reported cross-national differences are more likely to be in the expression of the disorder rather than in the eye of the beholder.  相似文献   

16.
用多元概化理论对普通话的测试   总被引:5,自引:0,他引:5  
杨志明  张雷 《心理学报》2002,34(1):51-56
用多元概化理论 (MGT)研究了国家语委编制的普通话测验。在G研究中 ,利用香港人普通话测试的数据 ,估计了引起分数变异的各种来源的方差与协方差分量。在D研究中 ,首先估计了该测验 3个部分的全域分数和各自的概化系数等技术指标 ,然后估计了全域合成分数及其概化系数、信噪比等指标。结果表明 ,该测验的信度从总体上讲是较高的 ,把三个部分的全域分数进行合成也是合理的 ,但从细节上看其第 3部分的信度较低。另外 ,当评分者个数为 3、试题数量为 2 8时 ,测验的第 1、2部分的信度已经较高 ,因此 ,在实测时减少这两部分的题量并不会有太大问题  相似文献   

17.
The baseline inter-rater reliability, test-retest reliability, follow-up inter-rater reliability, and follow-up longitudinal reliability (interrater reliability between generations of raters) of borderline symptoms and the diagnosis of borderline personality disorder (BPD) were assessed using the Revised Diagnostic Interview for Borderlines (DIB-R). Excellent kappa s (> .75) were found in each of these reliability substudies for the diagnosis of BPD itself. Excellent kappa s were also found in each of the three inter-rater reliability substudies for the vast majority of borderline symptoms assessed by the DIB-R. Test-retest reliability for these symptoms was somewhat lower but still very good. More specifically, one-third of the BPD symptoms assessed had a kappa in the excellent range and the remaining two-thirds had a kappa in the fair-good range (.57-.73). The dimensional reliability of BPD symptom areas was somewhat higher than for categorical measures of the subsyndromal phenomenology of BPD. More specifically, all five dimensional measures of borderline psychopathology had intraclass correlation coefficients in the excellent range for all four reliability substudies. Taken together, the results of this study suggest that both the borderline diagnosis and the symptoms of BPD can be diagnosed reliably when using the DIB-R. They also suggest that excellent reliability, once achieved, can be maintained over time for both the syndromal and subsyndromal phenomenology of BPD.  相似文献   

18.
Reliable and valid assessment of abnormal speech patterns may enable earlier recognition of nonpsychotic disorders through characteristic speech patterns. This study sought to establish interrater reliability using a standardized guide for scoring. A scoring guide defining 27 elements (e.g., inappropriate self-reference, simple loss of goal, circumstantiality) of disordered thought was developed. The seminal work of Andreasen's and Holzman's groups provided 12 elements, and 15 new elements were suggested by clinical literature. Audiotaped interviews with 12 psychiatric inpatients, adults of both sexes and various ages hospitalized for acute management of nonpsychotic psychiatric disorders, provided speech samples for observation of disordered thought by two independent raters. Using the guide's definitions and accompanying examples of elements of disordered thought, reliability in scoring was high (kappa of .85 for agreement on the presence of any abnormal speech element and kappa values from .66 to 1.00 for agreement on the presence of individual elements).  相似文献   

19.
The purpose of this study was to select effective tests of motor ability based on pass-or-fail criteria for use with preschool children. 37 items selected by examining theoretical validity and the results of preliminary tests were administered to preschool children (3 yr.: M = 3.7 yr., SD = 0.28; 4 yr.: M = 4.7 yr., SD = 0.28; 5 yr.: M = 5.7 yr., SD = 0.28). A skilled tester and each child's homeroom teacher rated whether the child's performance passed certain criteria or not. With agreement on two trials as an index of test-retest reliability, the mean agreement among the three grades ranged from 69% to 99% for Locomotion, 59% to 95% for Manipulation; and 66% to 100% for Stability. Disagreement on two trials may reflect instability in movement, practice effects, and so on. With agreement between two testers as an index of objectivity for 37 items, 33 showed interrater agreement of 80% or more for all three grades. No significant increase in pass rate with age was found on 10 items. In examining the three conditions mentioned above 27 items were selected as tests of motor ability: 14 items for Locomotion, 7 items for Manipulation, and 6 items for Stability.  相似文献   

20.
采用多侧面Rasch模型对28位评委在托幼机构教育质量评价中的评委偏差进行了分析。分析结果显示:28名评委评分宽严度差异显著;3名评委内部一致性较差,其余25名评委内部一致性较稳定;评委与评价班级的交互作用不显著,与评价项目的交互作用显著。研究结果表明MFRM可以对托幼机构教育质量评价的评委偏差进行个体层面的具体分析,从项目反应理论的视角为托幼机构教育质量评价的评委针对性培训、评估评委的合格性从而建立合格评委库等提供现代教育、心理测量学依据。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号