首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 0 毫秒
We study a proportional reduction in loss (PRL) measure for the reliability of categorical data and consider the general case in which each ofN judges assigns a subject to one ofK categories. This measure has been shown to be equivalent to a measure proposed by Perreault and Leigh for a special case when there are two equally competent judges, and the correct category has a uniform prior distribution. We consider a general framework where the correct category is assumed to have an arbitrary prior distribution, and where classification probabilities vary by correct category, judge, and category of classification. In this setting, we consider PRL reliability measures based on two estimators of the correct category—the empirical Bayes estimator and an estimator based on the judges' consensus choice. We also discuss four important special cases of the general model and study several types of lower bounds for PRL reliability.Bruce Cooil is Associate Professor of Statistics, and Roland T. Rust is Professor and area head for Marketing, Owen Graduate School of Management, Vanderbilt University. The authors thank three anonymous reviewers and an Associate Editor for their helpful comments and suggestions. This work was supported in part by the Dean's Fund for Faculty Research of the Owen Graduate School of Management, Vanderbilt University.  相似文献   

A basis for analyzing test-retest reliability   总被引:4,自引:0,他引:4  
Three sources of variation in experimental results for a test are distinguished: trials, persons, and items. Unreliability is defined only in terms of variation over trials. This definition leads to a more complete analysis than does the conventional one; Spearman's contention is verified that the conventional approach—which was formulated by Yule—introduces unnecessary hypotheses. It is emphasized that at least two trials are necessary to estimate the reliability coefficient. This paper is devoted largely to developinglower bounds to the reliability coefficient that can be computed from but asingle trial; these avoid the experimental difficulties of making two independent trials. Six different lower bounds are established, appropriate for different situations. Some of the bounds are easier to compute than are conventional formulas, and all the bounds assume less than do conventional formulas. The terminology used is that of psychological and sociological testing, but the discussion actually provides a general analysis of the reliability of the sum ofn variables.The writer is indebted to the members of his statistical seminar, to Professor Mark Kac, and to Professor Samuel A. Stouffer and his staff in the Research Branch, Information and Education Division, War Department, for their helpful comments on this paper.  相似文献   

The Raven Colored Progressive Matrices was administered to a sample of 259 children in Lithuania and re-administered 2 years later. The test-retest reliability was .499.  相似文献   

Two groups of students enrolled in a university physical activity course volunteered to complete Kolb's Learning Style Inventory at the beginning of and the end of a semester to estimate test-retest reliability. A control group (n = 129) completed the inventory in its original form while the experimental group (n = 124) completed the same test but with modified instructions providing a more specific focus. Test-retest reliability, assessed using a Pearson product-moment correlation, improved for the group given instructions which specified a contextual focus.  相似文献   

Forty-five Swedish couples ( N =90) independently completed a translation of Rothbart's Infant Behavior Questionnaire (IBQ) when their infants were 3 and 8 months of age. There was greater agreeement between mothers and fathers at 8 than at 3 months, perhaps because fathers became more involved as their children grew older. At neither age was agreement as great as that reported by Rothbart (1981) in the USA. Parents did not agree on the dimensions Duration of Orienting and Soothability. In the eyes of both parents, there was significant, although modest, stability over time on most dimensions of infant temperament. There was least perceived stability in Distress to Approaching Stimuli (Fear). These results suggest that the IBQ (even in a Swedish translation) may be a reliable and valid way of measuring parental perceptions of infant temperament.  相似文献   

WAIS test-retest reliabilities were calculated for a clinical out-patient sample with testing intervals varying from 1 to 10 yr. There was no relationship between test-retest interval and the stability of test scores and the correlations between IQs were quite satisfactory (Full Scale IQ = .897, Verbal IQ = .906, Performance IQ = .876). Reliabilities remained high even when the sample was divided by diagnosis into organic, neurotic, personality disorder, and schizophrenic subgroups.  相似文献   

Transient errors are caused by variations in feelings, moods, and mental states over time. If these errors are present, coefficient alpha is an inflated estimate of reliability. A true-score model is presented that incorporates transient errors for test-retest data, and a reliability estimate is derived. This estimate, referred to as the test-retest alpha, is less than coefficient alpha if transient error is present and is less susceptible to effects due to item recall than a test-retest correlation. An assumption underlying the test-retest alpha is essential tau equivalency of items. A test-retest split-half coefficient is presented as an alternative to the test-retest alpha when this assumption is violated. The test-retest alpha is the mean of all possible test-retest split-half coefficients.  相似文献   

Previous research efforts have developed and validated various scales potentially useful in evaluating service learning outcomes. The developmental efforts reported for the four scales examined in this study did not include the test-retest reliabilities that would provide assurance to service learning researchers of the long-term stability and therefore usefulness of these measures. Summary estimates of 13-wk. test-retest reliabilities for the scales Civic Participation, Self-efficacy Toward Service, Attitude Toward Helping Others, and College Education's Role in Addressing Social Issues provide service learning researchers with evidence of stability of the scales over the typical duration of service learning courses.  相似文献   

The reliability of magnitude-estimation scaling as a measure of overall clarity of speech was investigated. 40 subjects (M age = 19 yr.) provided magnitude-estimation responses for nine audiotaped versions of a nonsense sentence varying systematically in number of correct consonant phonemes. There was no significant difference in the magnitude-estimation responses of the subjects during two test sessions separated by one week. Analysis suggested that magnitude-estimation scaling is a reliable measure of speech clarity/intelligibility. This finding is discussed in relation to speech samples varying in aspects other than number of consonant phonemes correct and possible further clinical research applications.  相似文献   

In two studies, the construct (convergent and discriminant) validity and test-retest reliability of a date rape decision-latency measure was examined. In Study 1, 174 college men completed measures related to sexual aggression and listened to an audiotaped simulation of a date rape, during which cues of nonconsent and force gradually escalated over time. Participants were instructed to respond, by pressing a button which recorded the latency of their decisions in seconds, if and when they believed the man depicted in the scenario should stop his sexual advances. Results demonstrated positive associations between prolonged decision latencies and sexually aggressive behavior, calloused sexual beliefs, acceptance of interpersonal violence, and sexual promiscuity. In Study 2, initial results were cross-validated in a sample of 102 college men, and discriminant validity was established as decision latencies were unassociated with measures of social desirability, alcohol consumption and drug use. Test-retest reliability assessed over a 2-week interval was .87.The authors wish to thank Alan Gross and Brian Marx for providing the audiotaped stimulus materials, Richard Marsh for writing the decision-latency computer program, Jason Hicks for assisting with programming, and the undergraduate research assistants for serving as experimenters.  相似文献   

The baseline inter-rater reliability, test-retest reliability, follow-up inter-rater reliability, and follow-up longitudinal reliability (interrater reliability between generations of raters) of borderline symptoms and the diagnosis of borderline personality disorder (BPD) were assessed using the Revised Diagnostic Interview for Borderlines (DIB-R). Excellent kappa s (> .75) were found in each of these reliability substudies for the diagnosis of BPD itself. Excellent kappa s were also found in each of the three inter-rater reliability substudies for the vast majority of borderline symptoms assessed by the DIB-R. Test-retest reliability for these symptoms was somewhat lower but still very good. More specifically, one-third of the BPD symptoms assessed had a kappa in the excellent range and the remaining two-thirds had a kappa in the fair-good range (.57-.73). The dimensional reliability of BPD symptom areas was somewhat higher than for categorical measures of the subsyndromal phenomenology of BPD. More specifically, all five dimensional measures of borderline psychopathology had intraclass correlation coefficients in the excellent range for all four reliability substudies. Taken together, the results of this study suggest that both the borderline diagnosis and the symptoms of BPD can be diagnosed reliably when using the DIB-R. They also suggest that excellent reliability, once achieved, can be maintained over time for both the syndromal and subsyndromal phenomenology of BPD.  相似文献   

Tod D  Morrison TG  Edwards C 《Body image》2012,9(3):425-428
The current study assessed relationships among four commonly used drive for muscularity questionnaires, along with their 7 and 14 day test-retest reliability. Sample 1 was comprised of young British adult males (N=272; M(AGE)=20.3) who completed the questionnaires once. Sample 2, a group of young British adult males (N=54, M(AGE)=19.3), completed the questionnaires three times spaced 7 and 14 days apart. Correlations among Sample 1 ranged from .20 to .82 providing evidence for concurrent and discriminant validities. Evidence for test-retest reliability emerged with intraclass correlations ranging from .78 to .95 (p<.001) and generally nonsignificant t-tests (p>.05). Overall, the data support the psychometric properties of the drive for muscularity inventories; however, the shared variance (35-67%) hints that refinement is possible.  相似文献   

Although driving while intoxicated (DWI) is a pervasive problem, reliable measures of this behavior have been elusive. In the present study, the Form 90, a widely utilized alcohol and substance use instrument, was adapted for measurement of DWI and related behaviors. Levels of reliability for the adapted instrument, the Form 90-DWI, were tested among a university sample of 60 undergraduate students who had consumed alcohol during the past 90 days. The authors administered the instrument once during an intake interview and again, 7-30 days later, to determine levels of test-retest reliability. Overall, the Form 90-DWI demonstrated high levels of reliability for many general drinking and DWI behaviors. Levels of reliability were lower for riding with an intoxicated driver and for variables involving several behavioral conjunctions, such as seat belt use and the presence of passengers when driving with a blood alcohol concentration above .08. Overall, the Form 90-DWI shows promise as a reliable measure of DWI behavior in research on treatment outcome and prevention.  相似文献   

Glenn CR  Klonsky ED 《Assessment》2011,18(3):375-378
Nonsuicidal self-injury (NSSI) is a growing public health problem among adolescents and young adults. The Inventory of Statements About Self-Injury (ISAS) is a self-report measure designed to assess NSSI behaviors and functions. The current study examines the one-year test-retest reliability of the ISAS in a sample of young adult self-injurers. Results indicate that the ISAS behavioral and functional scales demonstrate good stability over one year. For the behavioral scales, test-retest correlations ranged from .52 (biting) to .83 (burning), with a median of .68. For the functional scales, test-retest correlations were .60 for the superordinate intrapersonal functions scale and .82 for the superordinate interpersonal functions scale. Regarding individual functions, test-retest correlations ranged from .35 (affect regulation) to .89 (peer bonding), with a median of .59. Findings suggest the ISAS has good test-retest reliability and contributes to the growing literature on the psychometric properties of the ISAS.  相似文献   

Sensitivity theory provides an analysis of personality based on what people say motivates their behavior. After Reiss and Havercamp (1998) confirmed a 15-factor solution to self-reported human strivings, the Reiss Profile of Fundamental Goals and Motivation Sensitivities (Reiss & Havercamp, 1998) psychometric instrument was standardized. In 3 studies, the Reiss Profile was shown to possess good test-retest and internal reliability and concurrent and criterion validity. Ten independent samples of adults (n = 764) and a comparison group (n = 737) participated in these studies. Pearson product-moment correlations between the Marlowe-Crowne Social Desirability Scale (Crowne & Marlowe, 1960) and the Reiss Profile ranged in absolute value from .01 to.39 (M =.16). How people self-reported their trait motives correlated with how they behaved in the "real world." The Reiss Profile can be used to study motivational traits.  相似文献   

An important aspect of human individual face recognition is the ability to discriminate unfamiliar individual. Since many general processes contribute to explicit behavioural performance in individual face discrimination tasks, isolating a measure of unfamiliar individual face discrimination ability in humans is challenging. In recent years, a fast periodic visual stimulation approach (FPVS) has provided objective (frequency-locked) implicit electrophysiological indices of individual face discrimination that are highly sensitive at the individual level within a few minutes of testing. Here we evaluate the test-retest reliability of this response across scalp electroencephalographic (EEG) recording sessions separated by more than two months, in the same 30 individuals. We found no test-retest difference overall across sessions in terms of amplitude and spatial distribution of the EEG individual face discrimination response. Moreover, with only 4 stimulation sequences corresponding to 4 min of recordings per session, the individual face discrimination response was highly reliable in terms of amplitude, spatial distribution, and shape. Together with previous observations, these results strengthen the diagnostic value of FPVS-EEG as an objective and rapid flag for specific difficulties at individual face recognition in the human population.  相似文献   

A method whereby biographical or other questionnaire data of a purely qualitative nature may be used to predict success or failure on an independent criterion is presented. The method is not new but the present least-squares derivation and the transformation equation for punched card coding were not available in the literature. The proper weights are found to be proportional to the per cent of passers in the various categories. The method is suggested as a suitable substitute for non-linear approaches in connection with purely quantitative data as well. The implications of reweighting in connection with multiple regression is discussed. The lavish use of degrees of freedom makes cross-validation extremely desirable.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号