This article argues that the general practice of describing interrater reliability as a single, unified concept is..at best imprecise, and at worst potentially misleading. Rather than representing a single concept, different..statistical methods for computing interrater reliability can be more accurately classified into one of three..categories based upon the underlying goals of analysis. The three general categories introduced and..described in this paper are: 1) consensus estimates, 2) consistency estimates, and 3) measurement estimates...The assumptions, interpretation, advantages, and disadvantages of estimates from each of these three..categories are discussed, along with several popular methods of computing interrater reliability coefficients..that fall under the umbrella of consensus, consistency, and measurement estimates. Researchers and..practitioners should be aware that different approaches to estimating interrater reliability carry with them..different implications for how ratings across multiple judges should be summarized, which may impact the..validity of subsequent study results. Accessed 123,170 times on https://pareonline.net from March 01, 2004 to December 31, 2019. For downloads from January 1, 2020 forward, please click on the PlumX Metrics link to the right.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Stemler, Steven E.
"A Comparison of Consensus, Consistency, and Measurement Approaches to Estimating Interrater Reliability,"
Practical Assessment, Research, and Evaluation: Vol. 9, Article 4.
Available at: https://scholarworks.umass.edu/pare/vol9/iss1/4