A POSCE was developed and administered in 2015 to assess six professional attributes for the Family Medicine (FM) residents, University of Medicine and Pharmacy (UMP), Vietnam. This study aims at exploring inter-rater reliability in FM POSCE developed in this context when analytic rubrics were applied.
Background: Past POSCEs showed raters’ variability on applying the global marking items and holistic rating. Using analytic rubrics, unlike holistic type, will provide more rationale for assigning a certain score might influence raters’ variability. Nonetheless, it is little known to what extent switching to this rubric type might influence the inter-rater reliability of POSCE.
Methods: Before the FM professionalism module (pretest) and after this module (posttest), 36 and 42 FM residents took the POSCE respectively. The raters in the pretest included 12 teachers of FM training center. Four faculty members from different faculties were belatedly added to the post-test together with the 12 former raters. Raters’ training occurred in two different times, the former took place only for the 12 FM raters before the pretest and the latter was before the posttest for the 4 belatedly-recruited. During the POSCE, one pair of raters observed all performances per station. Inter-rater reliability was measured by the differences in total scores between raters per pair using paired t-test and Pearson correlation coefficient.
Results: In POSCE pretest, no significant difference was found between raters’ scores in most pairs of raters, contrasting with that in the posttest. Most differences were noticed in the pairs of raters, in which one of the raters was the belatedly-recruited. In the pretest, moderate to strong positive correlation between raters’ mean scores were found (r=0.55-0.85), similar range was seen in the post-test (r=0.47-0.87), however, the correlation slightly weakened.
Discussion and conclusion: The FM POSCE has high inter-rater reliability on the utilization of analytic grading rubrics. An analytic rubric might help minimize the discrepancies among raters. Moreover, training raters might have been an alternative influential factor on the raters’ consensus.