首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Percentage agreement measures of interobserver agreement or "reliability" have traditionally been used to summarize observer agreement from studies using interval recording, time-sampling, and trial-scoring data collection procedures. Recent articles disagree on whether to continue using these percentage agreement measures, and on which ones to use, and what to do about chance agreements if their use is continued. Much of the disagreement derives from the need to be reasonably certain we do not accept as evidence of true interobserver agreement those agreement levels which are substantially probable as a result of chance observer agreement. The various percentage agreement measures are shown to be adequate to this task, but easier ways are discussed. Tables are given to permit checking to see if obtained disagreements are unlikely due to chance. Particularly important is the discovery of a simple rule that, when met, makes the tables unnecessary. If reliability checks using 50 or more observation occasions produce 10% or fewer disagreements, for behavior rates from 10% through 90%, the agreement achieved is quite improbably the result of chance agreement.  相似文献   

2.
Interval by interval reliability has been criticized for "inflating" observer agreement when target behavior rates are very low or very high. Scored interval reliability and its converse, unscored interval reliability, however, vary as target behavior rates vary when observer disagreement rates are constant. These problems, along with the existence of "chance" values of each reliability which also vary as a function of response rate, may cause researchers and consumers difficulty in interpreting observer agreement measures. Because each of these reliabilities essentially compares observer disagreements to a different base, it is suggested that the disagreement rate itself be the first measure of agreement examined, and its magnitude relative to occurrence and to nonoccurrence agreements then be considered. This is easily done via a graphic presentation of the disagreement range as a bandwidth around reported rates of target behavior. Such a graphic presentation summarizes all the information collected during reliability assessments and permits visual determination of each of the three reliabilities. In addition, graphing the "chance" disagreement range around the bandwidth permits easy determination of whether or not true observer agreement has likely been demonstrated. Finally, the limits of the disagreement bandwidth help assess the believability of claimed experimental effects: those leaving no overlap between disagreement ranges are probably believable, others are not.  相似文献   

3.
This paper examines a model and defines reasonable assumptions underlying different measures of observer agreement for categorical data collected in free operant situations. It is assumed that two or more observers classify operant behaviors of subjects into occurrences and nonoccurrences by recognition by validated response classes (categories) such that the rates of false positives and observer biases are acceptably low. Thus errors are mostly omissions, i.e., failing to observe events that occur. Four alternative cases are derived, together with formulas for calculating significance tests, variances, and standard errors, three of which do not depend on knowledge of the proportion of time points at which the event does not occur.We wish to acknowledge NICHD Grant HD-10570, The Neuropharmacology of Developmental Disorders, George Breese, Ph.D., and C. T. Gualtieri, M.D., Principal Investigators; NIEHS Grant ES-01104; USPHS Grant HD-03110; and MCH Project 916 to the Division for Disorders of Development and Learning.  相似文献   

4.
Proposed methods of assessing the statistical significance of interobserver agreements provide erroneous probability values when conducted on serially correlated data. Investigators who wish to evaluate interobserver agreements by means of statistical significance can do so by limiting the analysis to every k(th) interval of data, or by using Markovian techniques which accommodate serial correlations.  相似文献   

5.
Changes in imitative behavior and attentiveness were observed in 40 infants when they were 2 to 6 months of age. The facial expressions happy, sad, and surprised were modeled in a trials-to-criterion procedure, and the infants' looking time and mouth movements were recorded by an observer who was unaware of the face being modeled. In addition, the observer recorded her guess as to the expression being modeled by the corresponding expression on the infant's face and rated the infant's expressivity. The results suggested that looking time, correspondence between the mouth expression of the infant and the mouth expression modeled, accuracy of the observer's guess, and expressivity ratings decreased from 2 to 3 and 4 to 6 months. Although matching of mouth movements with the modeled mouth movements and accuracy of guesses were greater than chance over the 2 to 6 month-period, the decreases in these measures suggest that imitative behavior declined across early infancy. The decrease in looking time suggests that imitative behavior and attentiveness may be related and highlights the limitation of this paradigm for assessing the development of imitation during early infancy.  相似文献   

6.
Behavioral researchers have developed a sophisticated methodology to evaluate behavioral change which is dependent upon accurate measurement of behavior. Direct observation of behavior has traditionally been the mainstay of behavioral measurement. Consequently, researchers must attend to the psychometric properties, such as interobserver agreement, of observational measures to ensure reliable and valid measurement. Of the many indices of interobserver agreement, percentage of agreement is the most popular. Its use persists despite repeated admonitions and empirical evidence indicating that it is not the most psychometrically sound statistic to determine interobserver agreement due to its inability to take chance into account. Cohen's (1960) kappa has long been proposed as the more psychometrically sound statistic for assessing interobserver agreement. Kappa is described and computational methods are presented.  相似文献   

7.
Studies of agreement commonly occur in psychiatric research. For example, researchers are often interested in the agreement among radiologists in their review of brain scans of elderly patients with dementia or in the agreement among multiple informant reports of psychopathology in children. In this paper, we consider the agreement between two raters when rating a dichotomous outcome (e.g., presence or absence of psychopathology). In particular, we consider logistic regression models that allow agreement to depend on both rater- and subject-level covariates. Logistic regression has been proposed as a simple method for identifying covariates that are predictive of agreement (Coughlin et al., 1992). However, this approach is problematic since it does not take account of agreement due to chance alone. As a result, a spurious association between the probability (or odds) of agreement and a covariate could arise due entirely to chance agreement. That is, if the prevalence of the dichotomous outcome varies among subgroups of the population, then covariates that identify the subgroups may appear to be predictive of agreement. In this paper we propose a modification to the standard logistic regression model in order to take proper account of chance agreement. An attractive feature of the proposed method is that it can be easily implemented using existing statistical software for logistic regression. The proposed method is motivated by data from the Connecticut Child Study (Zahner et al., 1992) on the agreement among parent and teacher reports of psychopathology in children. In this study, parents and teachers provide dichotomous assessments of a child's psychopathology and it is of interest to examine whether agreement among the parent and teacher reports is related to the age and gender of the child and to the time elapsed between parent and teacher assessments of the child.The authors thank the Associate Editor and the referees for their helpful comments and suggestions. We also thank Gwen Zahner for use of data from the Connecticut Child Study, which was conducted under contract to the Connecticut Department of Children and Youth Services. This research was supported by grants HL 69800, AHRQ 10871, HL52329, HL61769, GM 29745, MH 54693 and MH 17119 from the National Institutes of Health.  相似文献   

8.
The access to outpatient psychotherapy in Germany is regulated by an application and expert opinion procedure in a peer-review system. In an external assessment procedure, the application of each patient is considered concerning the existence of a mental disorder, a positive prognosis as well as the adequacy of the chosen therapy rationale. The present paper examines the reliability of this procedure by reanalysing the data from three studies on interrater agreement in the expert opinions about psychoanalytic/psychodynamic therapy, behaviour therapy or child and youth behaviour therapy. In the study of Rudolf et al. (2002) 48 experts re-examined two already assessed cases, in the studies of Sulz et al. (2003) as well as Sulz and Peterander (2004) each of 30 and respectively 7 experts had judged five non selected or seven selected applications. The interrater agreement was calculated using the kappa coefficient by Fleiss for the agreement among many raters, which tests the observed agreement probability against the expected agreement probability that would occur by chance. The level of agreement among the experts differs between 46% and 70%. With the chosen method it is mostly not possible to show that there is a significant higher agreement than by chance. The generalizability of the results to the usual assessment procedure is discussed as well as their potential for the advancement of the application procedure and expert peer-review system.  相似文献   

9.
Graphical and statistical indices employed to represent observer agreement in interval recording are described as "judgmental aids", stimuli to which the researcher and scientific community must respond when viewing observer agreement data. The advantages and limitations of plotting calibrating observer agreement data and reporting conventional statistical aids are discussed in the context of their utility for researchers and research consumers of applied behavior analysis. It is argued that plotting calibrating observer data is a useful supplement to statistical aids for researchers but is of only limited utility for research consumers. Alternatives to conventional per cent agreement statistics for research consumers include reporting special agreement estimates (e.g., per cent occurrence agreement and nonoccurrence agreement) and correlational statistics (e.g., Kappa and Phi).  相似文献   

10.
Various statistics have been proposed as standard methods for calculating and reporting interobserver agreement scores. The advantages and disadvantages of each have been discussed in this journal recently but without resolution. A formula is presented that combines separate measures of occurrence and nonoccurrence percentages of agreement, with weight assigned to each measure, varying according to the observed rate of behavior. This formula, which is a modification of a formula proposed by Clement (1976), appears to reduce distortions due to "chance" agreement encountered with very high or low observed rates of behavior while maintaining the mathematical and conceptual simplicity of the conventional method for calculating occurrence and nonoccurrence agreement.  相似文献   

11.
When a theoretical psychometric function is fitted to experimental data (as in the obtaining of a psychophysical threshold), maximum-likelihood or probit methods are generally used. In the present paper, the behavior of these curve-fitting methods is studied for the special case of forced-choice experiments, in which the probability of a subject's making a correct response by chance is not zero. A mathematical investigation of the variance of the threshold and slope estimators shows that, in this case, the accuracy of the methods is much worse, and their sensitivity to the way data are sampled is greater, than in the case in which chance level is zero. Further, Monte Carlo simulations show that, in practical situations in which only a finite number of observations are made, the mean threshold and slope estimates are significantly biased. The amount of bias depends on the curve-fitting method and on the range of intensity values, but it is always greater in forced-choice situations than when chance level is zero.  相似文献   

12.
Interobserver agreement (also referred to here as "reliability") is influenced by diverse sources of artifact, bias, and complexity of the assessment procedures. The literature on reliability assessment frequently has focused on the different methods of computing reliability and the circumstances under which these methods are appropriate. Yet, the credence accorded estimates of interobserver agreement, computed by any method, presupposes eliminating sources of bias that can spuriously affect agreement. The present paper reviews evidence pertaining to various sources of artifact and bias, as well as characteristics of assessment that influence interpretation of interobserver agreement. These include reactivity of reliability assessment, observer drift, complexity of response codes and behavioral observations, observer expectancies and feedback, and others. Recommendations are provided for eliminating or minimizing the influence of these factors from interobserver agreement.  相似文献   

13.
Among the problems of understanding mental pathology through a labeling perspective is the need to understand more about the attributional process itself. It is postulated that characteristics of observers, in particular their attitudes, influence the attribution of mental disorder to individuals manifesting deviant behavior. Questionnaire items were factor analyzed to produce several dimensions of attitudes and types of deviance. Tests of seven sub‐hypotheses provide support for the major hypothesis that the probability mental disorder will be attributed by an observer to an actor is positively related to the degree an actor's behavior (implying beliefs or attitudes) differs from the beliefs and attitudes of the observer.  相似文献   

14.
The most common measure of agreement for categorical data is the coefficient kappa. However, kappa performs poorly when the marginal distributions are very asymmetric, it is not easy to interpret, and its definition is based on hypothesis of independence of the responses (which is more restrictive than the hypothesis that kappa has a value of zero). This paper defines a new measure of agreement, delta, ‘the proportion of agreements that are not due to chance’, which comes from model of multiple‐choice tests and does not have the previous limitations. The paper shows that kappa and delta generally take very similar values, except when the marginal distributions are strongly unbalanced. The case of the 2 × 2 tables (which admits very simple solutions) is considered in detail.  相似文献   

15.
The authors describe the development and initial validation of a home-based version of the Laboratory Temperament Assessment Battery (Lab-TAB), which was designed to assess childhood temperament with a comprehensive series of emotion-eliciting behavioral episodes. This article provides researchers with general guidelines for assessing specific behaviors using the Lab-TAB and for forming behavioral composites that correspond to commonly researched temperament dimensions. We used mother ratings and independent postvisit observer ratings to provide validity evidence in a community sample of 4.5-year-old children. 12 Lab-TAB behavioral episodes were employed, yielding 24 within-episode temperament components that collapsed into 9 higher level composites (Anger, Sadness, Fear, Shyness, Positive Expression, Approach, Active Engagement, Persistence, and Inhibitory Control). These dimensions of temperament are similar to those found in questionnaire-based assessments. Correlations among the 9 composites were low to moderate, suggesting relative independence. As expected, agreement between Lab-TAB measures and postvisit observer ratings was stronger than agreement between the Lab-TAB and mother questionnaire. However, for Active Engagement and Shyness, mother ratings did predict child behavior in the Lab-TAB quite well. Findings demonstrate the feasibility of emotion-eliciting temperament assessment methodologies, suggest appropriate methods for data aggregation into trait-level constructs and set some expectations for associations between Lab-TAB dimensions and the degree of cross-method convergence between the Lab-TAB and other commonly used temperament assessments.  相似文献   

16.
In many learning or inference tasks human behavior approximates that of a Bayesian ideal observer, suggesting that, at some level, cognition can be described as Bayesian inference. However, a number of findings have highlighted an intriguing mismatch between human behavior and standard assumptions about optimality: People often appear to make decisions based on just one or a few samples from the appropriate posterior probability distribution, rather than using the full distribution. Although sampling‐based approximations are a common way to implement Bayesian inference, the very limited numbers of samples often used by humans seem insufficient to approximate the required probability distributions very accurately. Here, we consider this discrepancy in the broader framework of statistical decision theory, and ask: If people are making decisions based on samples—but as samples are costly—how many samples should people use to optimize their total expected or worst‐case reward over a large number of decisions? We find that under reasonable assumptions about the time costs of sampling, making many quick but locally suboptimal decisions based on very few samples may be the globally optimal strategy over long periods. These results help to reconcile a large body of work showing sampling‐based or probability matching behavior with the hypothesis that human cognition can be understood in Bayesian terms, and they suggest promising future directions for studies of resource‐constrained cognition.  相似文献   

17.
Two sources of variability must each be considered when examining change in level between two sets of data obtained by human observers; namely, variance within data sets (phases) and variability attributed to each data point (reliability). Birkimer and Brown (1979a, 1979b) have suggested that both chance levels and disagreement bands be considered in examining observer reliability and have made both methods more accessible to researchers. By clarifying and extending Birkimer and Brown's papers, a system is developed using observer agreement to determine the data point variability and thus to check the adequacy of obtained data within the experimental context.  相似文献   

18.
Fighting between males is a frequent component of the rutting behavior of Cervidae. Frequent conflicts are exhausting; fighting may be risky and can lead to serious injuries or even death. We focused on the process of assessment of the opponent's fighting ability and escalation of the combat, estimating the probability of fighting based on the encounter components such as groaning and parallel walk. In this study, we observed the agonistic behavior of fallow deer bucks (Dama dama) during the rut over four seasons. During this time, we recorded 205 encounters between bucks. Non-contact display, which allows contestants to assess their opponents fighting ability, occurred in 83% of the encounters. The highest predicted probability of a fight was found when both of the males vocalized and turned into the parallel walk. The chance of a clear outcome decreased when the males were fighting in comparison to when they did not fight. The initiator of the competitive encounter won 41% of the cases, while the attacked buck won 23% of the encounters. If the contestants avoided fighting, however, the initiator won 78% of encounters. Therefore, the initiator was more successful when no fight occurred compared to when the encounters escalated into fighting. In most cases where ritualized behavior occurred, one of the opponents left after vocalization or parallel walk occurred. Thus, vocalization and parallel walk increased the probability for a clear outcome. The probability of a fight was lowest in situations where the males displayed asymmetric behavior. Increased symmetry of the contestants' behavior was strongly correlated with a higher probability of a fight. Thus, these results indicate that fallow deer bucks use efficient tactic during the rut, which, in turn, minimizes the chance of injury while fighting during the breeding season.  相似文献   

19.
When using behavioral-observation methods for coding video footage, it is unknown how much time of an interaction needs to be coded to gain results that are representative for the behavior of interest. The current study examined this problem using the INTAKT, a standardized observational measure for assessing the quality of mother-child interactions. Results from coding only 10 min of each video (i.e., thin slices) were compared with results from coding the remaining parts (averaging about 40 min) of the interaction. Inter-rater agreement for the short versions taken from the beginning or the middle, but not the end of the interactions indicated satisfactory observer accuracy. Coding results did not differ between short and long video sequences, when sequences were taken from the middle of the interactions. Importantly, characteristic differences between different interactive situations were equally well represented in the short and long video sequences. Therefore, our results show that coding only 10 min of an interaction is as reliable and valid as coding full-length videos, if those short sequences are taken from the middle of an interaction. Our findings support the idea that for every method, it is necessary to individually determine the window duration that is long enough to gain results that are reliable and valid.  相似文献   

20.
Methods are presented for estimating inter-subject variability of the probability of a given event defined in terms of subject's behavior (e.g., probability of a given choice in a discrimination experiment). The constraints consist of using no more than two independent observations for each subject. Estimators are provided for assessing the inter-subject “variance” of the analyzed probabilities; also, a method is given for testing the hypothesis that the average probability is the same for two groups of subjects.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号