首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We derive an analytic model of the inter-judge correlation as a function of five underlying parameters. Inter-cue correlation and the number of cues capture our assumptions about the environment, while differentiations between cues, the weights attached to the cues, and (un)reliability describe assumptions about the judges. We study the relative importance of, and interrelations between these five factors with respect to inter-judge correlation. Results highlight the centrality of the inter-cue correlation. We test the model’s predictions with empirical data and illustrate its relevance. For example, we show that, typically, additional judges increase efficacy at a greater rate than additional cues.  相似文献   

2.
We investigate the case of a Decision Maker (DM) who obtains probabilistic forecasts regarding the occurrence of a target event from J distinct, asymmetric advisors. In this context, asymmetry is induced by manipulating: (1) amount of information (number of diagnostic cues) available to each advisor and (2) quality (accuracy) of advisors’ previous forecasts. Empirical results from two experiments indicate that the DM’s final estimate can be described as a weighted average of advisor forecasts, where the weights are sensitive to both sources of asymmetry. This work extends the model derived by Budescu and Rantilla (2000) for the DMs confidence in the aggregate to accommodate advisor asymmetry. As in the symmetric case, the DM’s confidence in the weighted average of the forecasts is a function of the number of judges, the total number of cues, the (inferred) inter-judge correlation, and the level of inter-judge overlap in information. The extended model predicts that confidence increases as a function of asymmetry among judges. Empirical results support the main (ordinal) predictions of the model, including the predicted effect of inter-judge asymmetry.  相似文献   

3.
The average probability estimate of J > 1 judges is generally better than its components. Two studies test 3 predictions regarding averaging that follow from theorems based on a cognitive model of the judges and idealizations of the judgment situation. Prediction 1 is that the average of conditionally pairwise independent estimates will be highly diagnostic, and Prediction 2 is that the average of dependent estimates (differing only by independent error terms) may be well calibrated. Prediction 3 contrasts between- and within-subject averaging. Results demonstrate the predictions' robustness by showing the extent to which they hold as the information conditions depart from the ideal and as J increases. Practical consequences are that (a) substantial improvement can be obtained with as few as 2-6 judges and (b) the decision maker can estimate the nature of the expected improvement by considering the information conditions.  相似文献   

4.
非参数认知诊断分类方法非常适合课堂评估,其诊断结果采用0-1形式而缺乏概率化表征,不能精细地区分被试属性掌握程度的差异或变化,还缺乏可用于评价真实测验分类结果的信度和效度指标。要刻画被试属性掌握程度的差异,首要的问题是要为非参数认知诊断方法提供一种可以量化属性掌握概率的方法。针对此问题,基于二项分布和玻尔兹曼分布提出非参数认知诊断方法下诊断结果的概率化表征方法,并用于构建分类准确性和分类一致性指标。模拟研究与实测数据分析结果显示:概率化表征方法与非参数认知诊断方法的分类结果高度一致;概率化表征方法与认知诊断模型所得的属性掌握概率十分接近;概率化表征方法所得的属性(模式)掌握概率可用于计算属性(模式)分类准确性和分类一致性指标,在实际测验情景下可作为信度和效度指标,评价诊断结果的重测一致率和判准率。  相似文献   

5.
摘要:引入了三种可以估计认知诊断属性分类一致性信度置信区间的方法:Bootstrap法、平行测验法和平行测验配对法。用模拟研究验证和比较了这三种方法的表现,结果发现,平行测验法和Bootstrap法在被试量比较少、题目数量比较少的情况下,估计的标准误和置信区间较接近,但是随着被试量的增加,Bootstrap法的估计精度提高较快,在被试量大和题目数量较多时基本接近平行测验配对法的结果。Bootstrap法的所需时间最少,平行测验配对法计算过程复杂且用时较长,推荐用Bootstrap法估计认知诊断属性分类一致性信度的置信区间。  相似文献   

6.
认知诊断模型选择是认知诊断评估中重要研究问题之一。在实际应用中实践者并不知道真正拟合数据的模型,通常会用模型拟合指标检验模型与数据的拟合程度。从测量结果质量来看,除保证模型与数据拟合之外,还需要重点评价模型诊断结果的信度和效度等。考虑到以往研究大都采用基于信息量的拟合指标去判定模型与数据的匹配性,本研究提出综合考虑模型拟合指标与信度指标用于模型选择或评价模型误设。考虑实验因素为真实模型或分析模型(DINA模型、G-DINA模型、R-RUM模型)、样本量、题量和属性个数,在五因素(3×3×2×2×2)实验设计条件下,比较Bootstrap区间估计的属性分类一致性信度平均数与标准误和常用的拟合统计量-2LL、AIC、BIC对正确模型的选择率。结果表明:-2LL在题目数量多的情况下表现较好,而AIC、BIC在被试量较大的情况下表现较好,在不同的研究条件下,-2LL、AIC、BIC的模型选择率很不稳定,而用Bootstrap法估计的属性分类一致性信度平均数和标准误在不同研究条件的模型选择率较稳定,总体表现较好。  相似文献   

7.
丁树良  毛萌萌  汪文义  罗芬  CUI Ying 《心理学报》2012,44(11):1535-1546
构建正确的认知模型是成功进行认知诊断的关键之一,如果认知诊断测验不能完整准确地代表这个认知模型,这个测验的效度就存在问题.属性及其层级可以表示一个认知模型.在认知模型正确基础上,给出了一个计量公式以衡量认知诊断测验能够多大程度上代表认知模型;对于不止包含一个知识状态的等价类及其形成原因进行了分析,对Cui等人的属性层级相合性指标(HCI)提出修改建议,以更好地探查数据与专家给出的认知模型的一致性.  相似文献   

8.
A procedure for evaluating a variety of rater reliability models is presented. A multivariate linear model is utilized to describe and assess a set of ratings. The parameters of such a model are reexpressed in terms of a factor-analytic model. Maximum likelihood methods are employed to estimate and test the parameters in this factor-analytic model. The approach is related to the use of the intraclass correlation coefficient to estimate reliability. Two examples are presented, and the results contrasted to those found with an intraclass correlation approach. Extensions of the procedure to multiple sets of judges, multiple measures, and multiple groups is introduced.  相似文献   

9.
This paper analyzes existing research on the test–retest reliability of human judgment, i.e. the extent to which a judge makes identical judgments when presented with identical stimuli on two occasions. Only research involving professional judges who make experimental judgments in a reasonable analog of their everyday experience is included. Studies of both internal consistency reliability and temporal stability reliability are analyzed (where the former refers to the inclusion of repeat stimuli in the same experimental session, and the latter refers to the repeating of the experimental task from a few days to several months later). It is found that (1) the test–retest reliability literature is concentrated in four substantive judgment areas (medicine/psychology, meteorology, human resources management, and business), (2) the literature is extremely variable in terms of research approach/design, the determinants or correlates of test–retest reliability that have been studied, and the quality of the execution and analysis, and (3) mean test–retest reliability differs across both substantive judgment areas and the internal consistency versus temporal stability distinction. An inescapable conclusion from the analysis is that our knowledge of this fundamental property of human judgment is quite meager. Therefore, the paper concludes with suggestions about future research that would address test–retest reliability more systematically. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

10.
Using an independent samples factorial design, this study examined the roles of accent (standard vs. nonstandard), speech rate (fast vs. medium vs. slow), and age of voice (younger vs. older sounding) on British listeners’ social evaluations of audiotaped voices using the matched-guise technique. In addition, listener judges’ level and nature of cognitive responding, their interpretations of the targets’ utterances and the (mediumterm) recognition value of messages were uniquely explored as a function of these three independent variables. In general, standard speakers were upgraded on competencerelated traits but downgraded on solidarity traits irrespective of age, with older speakers being perceived as less hesitant but more benevolent than their younger counterparts. An Age × Accent interaction effect showed that older-sounding standard speakers were judged the most competent and older-sounding nonstandard speakers the least competent. Favorable ratings were afforded speakers with medium rates, and slow-talking, younger-sounding speakers were particularly downgraded. All three independent variables affected ratings of listeners’ interpretations of the (same) text, while speaker age was the only effect on the recognition of message material 2 days later. The cognitive responding data showed that listener judges were most positive about the source when the target was fast talking and older sounding and were most negative to the fast-talking, younger-sounding, and standard-accented speaker. The diverse pattern of findings emerging at different levels of analysis underscores the important roles of cognitive mediation in language attitude studies in ways not explored sufficiently previously.  相似文献   

11.
The authors examined aspects of reliability and validity of the Goodenough-Harris Draw-A-Person Test (DAP; D. B. Harris, 1963). The participants were 115 seven- to nine-year-old students attending regular or special education schools. Three judges, with a modest degree of training similar to that found among practicing clinicians, rated the students' human figure drawings on developmental and personality variables. The authors found that counting details and determining developmental level in the DAP test could be carried out reliably by judges with limited experience. However, the reliability of judgments of children's social and emotional development and personality was insufficient. Older students and students attending regular schools received significantly higher scores than did younger students or students attending special education schools. The authors found that the success of the DAP test as an indicator of cognitive level, socioemotional development, and personality is limited when global judgments are used. The authors concluded that more specific, reliable, valid, and useful scoring systems are needed for the DAP test.  相似文献   

12.
The present study examined several psychometric issues relevant to the use of a favored technique (the Angoff method) used to set standards in criterion referenced testing. The research was conducted within a setting which allowed (a) confident identification of expert and non-expert judges, and (b) estimation of "true" scores for items judged so that accuracy of judgments in addition to reliability could be examined. Results suggested that expertise of judges does make a difference in producing more accurate and reliable data, underscoring the importance of using true subject matter experts (SMEs) in the judgment process. A rater analysis technique (rater-total correlations) was illustrated, which might prove useful in improving the quality of data obtained using the Angoff method, particularly when there is some question regarding the internal consistency of ratings and expertise of some of the raters. Finally, a rater accuracy adjustment/calibration technique was examined and proved to be a potentially useful method to maximize accuracy of a standard derived using the Angoff method in settings where archival normative test data can be obtained. Other methods that could potentially be used to improve Angoff data were discussed.  相似文献   

13.
The concept of transliminality ("a hypothesized tendency for psychological material to cross thresholds into or out of consciousness") was anticipated by William James (1902/1982), but it was only recently given an empirical definition by Thalbourne in terms of a 29-item Transliminality Scale. This article presents the 17-item Revised Transliminality Scale (or RTS) that corrects age and gender biases, is unidimensional by a Rasch criterion, and has a reliability of.82. The scale defines a probabilistic hierarchy of items that address magical ideation, mystical experience, absorption, hyperaesthesia, manic experience, dream interpretation, and fantasy proneness. These findings validate the suggestions by James and Thalbourne that some mental phenomena share a common underlying dimension with selected sensory experiences (such being overwhelmed by smells, bright lights, sights, and sounds). Low scores on transliminality remain correlated with "tough mindedness" in on Cattell 16PF test, as well as "self-control" and "rule consciousness," whereas high scores are associated with "abstractedness" and an "openness to change" on that test. An independent validation study confirmed the predictions implied by our definition of transliminality. Implications for test construction are discussed.  相似文献   

14.
A general model of consensus and accuracy in interpersonal perception   总被引:1,自引:0,他引:1  
Consensus refers to the extent to which 2 judges agree in their ratings of a common target. A general model of interpersonal perception based on Anderson's (1981) weighted-average model is developed. The model shows that increased acquaintance does not always lead to large changes in consensus. Degree of overlap between the target behaviors observed by the judges and similarity of meaning systems are key but neglected parameters. The model can also be used as a basis for determining the accuracy of person perception. In some cases, accuracy can increase with greater acquaintance, whereas consensus may not.  相似文献   

15.
As a method specifically intended for the study of messages, content analysis is fundamental to mass communication research. Intercoder reliability, more specifically termed intercoder agreement, is a measure of the extent to which independent judges make the same coding decisions in evaluating the characteristics of messages, and is at the heart of this method. Yet there are few standard and accessible guidelines available regarding the appropriate procedures to use to assess and report intercoder reliability, or software tools to calculate it. As a result, it seems likely that there is little consistency in how this critical element of content analysis is assessed and reported in published mass communication studies. Following a review of relevant concepts, indices, and tools, a content analysis of 200 studies utilizing content analysis published in the communication literature between 1994 and 1998 is used to characterize practices in the field. The results demonstrate that mass communication researchers often fail to assess (or at least report) intercoder reliability and often rely on percent agreement, an overly liberal index. Based on the review and these results, concrete guidelines are offered regarding procedures for assessment and reporting of this important aspect of content analysis.  相似文献   

16.
《创造性行为杂志》2017,51(3):216-224
When people generate responses during a divergent thinking task, some responses are “old” (retrieved from memory) and some are “new” (generated on the spot). K.J. Gilhooly, E. Fioratou, S.H. Anthony, and V. Wynn (2007) suggested that old and new responses stem from different cognitive strategies and differ in key ways. The present research explored the old/new scoring method in a sample of 143 young adults. After completing unusual uses tasks, the participants classified each response as old or new. The creativity of each response was also rated by three judges and by the participants themselves. As in past research, “old” responses appeared significantly earlier in the task and were rated as significantly less creative by both the judges and the participants. Old and new responses, however, correlated equally strongly with predictors of creative ability, such as openness to experience and its facets. Overall, the old/new scoring approach appears promising as a way of illuminating the diverse mental strategies people use to generate ideas.  相似文献   

17.
TENI (Test de Evaluación Neuropsicológica Infantil) is an instrument developed to assess cognitive abilities in children between 3 and 9 years of age. It is based on a model that incorporates games and technology as tools to improve the assessment of children’s capacities. The test was standardized with two Chilean samples of 524 and 82 children living in urban zones. Evidence of reliability and validity based on current standards is presented. Data show good levels of reliability for all subtests. Some evidence of validity in terms of content, test structure, and association with other variables is presented. This instrument represents a novel approach and a new frontier in cognitive assessment. Further studies with clinical, rural, and cross-cultural populations are required.  相似文献   

18.
Investigated the process of personality inference from voice quality using 24 male American stimulus persons who served as subjects in simulated jury discussions. Applying a Brunswikian lens model of the inference process, criteria, distal cues, proximal cues and attributions were measured by independent groups of judges: personality criteria by three peers of each stimulus person and, on the basis of content-masked voice samples, distal voice quality indicator cues by six phoneticians, proximal voice percept by ten naive judges, personality attributions by nine naive judges. Only extroversion attributions correlate significantly with the criterion, replicating earlier findings. For the inference of extroversion, contrary to other traits which apparently cannot be inferred accurately from voice quality, the following conditions are met: (a) the criterion is associated with ecologically valid voice energy cues (vocal effort and dynamic range), (b) these indicator cues are adequately represented as proximal voice percepts (particularly loudness and sharpness), and(c) percept utilization in the judges' inferential strategy corresponds to the association between criterion and distal indicator cues. Path-analytic procedures are used to test empirically the adequacy of the inference model to (a) account for the variance in the attributions, and (b) explain significant correlations between criteria and attributions in terms of mediating variables.  相似文献   

19.
20.
Alcohol and social behavior I: The psychology of drunken excess   总被引:6,自引:0,他引:6  
Drinking alcohol clearly has important effect on social behaviors, such as increasing aggression, self-disclosure, sexual adventuresomeness, and so on. Research has shown that these effects can stem from beliefs we hold about alcohol effects. Less is known about how alcohol itself affects these behaviors. A cognitive explanation, that alcohol impairs the information processing needed to inhibit response impulses--the abilities to foresee negative consequences of the response, to recall inhibiting standards, and so on--has begun to emerge. We hypothesize that alcohol impairment will make a social response more extreme or excessive when the response is pressured by both inhibiting and instigating cues--in our terms, when it is under inhibitory response conflict. In that case, alcohol's damage to inhibitory processing allows instigating pressures more sway over the response, increasing its extremeness. In the present meta-analysis, each published test of alcohol's effect on a social, or socially significant behavior was rated (validated against independent judges) as to whether it was under high or low inhibitory conflict. Over low-conflict tests, intoxicated subjects behaved only a tenth of a standard deviation more extremely than their sober controls, whereas over high-conflict tests they were a full standard deviation more extreme. The effect of conflict increased with alcohol dosage, was shown not to be mediated by drinking expectancies, and generalized with few exceptions across the 34 studies and 12 social behaviors included in this analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号