Although inter-rater reliability is an important aspect of using observational instruments, it has received little theoretical attention. In this article, we offer some guidance for practitioners and consumers of classroom observations so that they can make decisions about inter-rater reliability, both for study design and in the reporting of data and results. We reviewed articles in two major journals in the fields of reading and mathematics to understand how researchers have measured and reported inter-rater reliability in a recent decade. We found that researchers have tended to report measures of inter-rater agreement above the .80 threshold with little attention to the magnitude of score differences between raters. Then, we conducted simulations to understand both how different indices for classroom observation reliability are related to each other and the impact of reliability decisions on study results. Results from the simulation studies suggest that mean correlations with an outcome are slightly lower at lower levels of percentage of exact agreement but that the magnitude of score differences has a more dramatic effect on correlations. Therefore, adhering to strict thresholds for inter-rater agreement is less helpful than reporting exact point estimates and also examining measures of rater consistency. Accessed 2,893 times on https://pareonline.net from April 05, 2018 to December 31, 2019. For downloads from January 1, 2020 forward, please click on the PlumX Metrics link to the right.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Wilhelm, Anne Garrison; Rouse, Amy Gillespie; and Jones, Francesca
"Exploring Differences in Measurement and Reporting of Classroom Observation Inter-Rater Reliability,"
Practical Assessment, Research, and Evaluation: Vol. 23
, Article 4.
Available at: https://scholarworks.umass.edu/pare/vol23/iss1/4