Many situations in the healthcare sector depend on several people to collect research or clinical laboratory data. The question of consistency or concordance between the people collecting data arises immediately because of the variability between human observers. Well-designed research studies must therefore include methods that measure the concordance between the different data collectors. Study designs typically include training data collectors and measuring the extent to which they record the same values for the same phenomena. Perfect agreement is rarely achieved and confidence in the results of the study depends in part on the amount of disagreements or errors introduced into the study due to inconsistency between data collectors. The extent of the concordance between the data collectors is called “interrater reliability”. A good example of concern about the importance of Kappa`s results is a paper comparing visual detection of abnormalities in biological samples in humans with automated detection (12). The results showed only a moderate match between humans and automated assessors for Kappa (κ = 0.555), but the same data revealed an excellent percentage match of 94.2%. The problem with interpreting the results of these two statistics is: how can researchers decide whether evaluators are reliable or not? Do the results obtained indicate whether or not the vast majority of patients receive accurate laboratory results and therefore correct medical diagnoses? In the same study, the researchers selected a data collector as the standard and compared the results of five other technicians to the standard. While sufficient data to calculate a percentage over-compliance is not included in the paper, Kappa`s results were moderate. How can the head of the laboratory know whether the results are quality measures, with few differences of opinion between the trained laboratory technicians, or whether there is a serious problem and continuous training is needed? Unfortunately, kappa statistics do not provide enough information to make such a decision. In addition, a Kappa can have such a wide confidence interval (CI) that it encompasses everything from good to bad concordance. So far, the discussion has considered that the majority is correct and that the minority evaluators are wrong in their scores and that all the evaluators have made a deliberate choice of rating.
Jacob Cohen understood that this hypothesis could be wrong. Indeed, he explicitly stated that “in the typical situation, there is no criterion of `accuracy` of judgments” (5). Cohen proposes the possibility that at least for some variables, none of the evaluators were sure of the score they had to enter and that they simply made random assumptions. . . .