Foreign Language Annals • Vol. 36, No. 4 517
confidence in, and the generalizability of, the relatively high
level of interrater reliability and consistency demonstrated by
experienced ACTFL OPI interviewers. Moreover, the present
study included more languages, a larger sample of raters, and
a more comprehensive approach than previous studies.
Important as well is that this study directly addresses the
Standards for Educational and Psychological Testing (AERA,
1999) related to reporting reliability evidence to users of an
assessment. Overall, the results provide good news for those
who use the ACTFL OPI to make decisions about speaking
proficiency in the 19 languages examined in our study. We
recommend that ACTFL conduct and publish the results of
interrater consistency and agreement analyses every three to
five years to continue to meet the guidelines established by the
Standards. This becomes particularly salient for those lan-
guages, such as Tagalog and Albanian, that contained small
samples sizes. Furthermore, we openly encourage other
providers of OPI assessment to follow this suggestion as well.
Research questions 2 through 5 were included to provide
additional information related to the functioning of the
ACTFL interview protocol under the Revised Guidelines for
speaking proficiency. If interrater reliability and agreement are
not affected by these other characteristics (e.g., language diffi-
culty), then these findings provide additional evidence that
the rating scale and protocols are functioning as intended and
with reasonable precision. Question 2 addresses whether
interrater consistency remains similar when testing frequency
in a language is considered. Since the results indicate that the
levels of interrater consistency by testing density are virtually
identical across the categories (MCT and LCT), we can con-
clude that the protocol is not unduly affected by the density of
testing. If it had been, the reasons for this difference would be
a point for future investigation.
Research question 3 addresses the issue of whether the
difficulty level of the language tested (in terms of language
learning) has an impact on the reliability of the OPI ratings.
Again, the results indicate that language difficulty has no sig-
nificant impact on interrater consistency and agreement,
suggesting that this is not an issue for further investigation.
Research question 4 investigates interrater reliability
within each language. For the 19 languages in this study, the
interrater consistency and agreement results were above the
acceptable range and very consistent across the different
indices. The results for two languages, Italian and Arabic, were
slightly lower than for the majority of the languages. With the
data available to us, we were unable to empirically determine
why this might be the case. A number of different factors
could be affecting the consistency and agreement of the
ratings, including characteristics of the raters, characteristics
of the ratees, characteristics of the language or dialects, or the
interaction of these factors. Of course, it could be a case of
simply having aberrant raters who are in need of more train-
ing. We recommend that ACTFL investigate this issue by
examining the most likely factors. Investigating the function-
ing of individual raters and pairs of raters would a good place
to start.
Question 5 addresses whether or not the interrater agree-
ment results vary across the major proficiency levels (Novice,
Intermediate, Advanced, and Superior) and the nature of the
disagreement (i.e., within or between major proficiency lev-
els). Unlike our expectations for questions 2 through 4, we
expected question 5 to demonstrate a difference between pro-
ficiency categories. Whenever subjective ratings are being
made across a continuum and rater agreement is calculated,
the highest agreement between raters should be expected for
the extreme scale points or values, as the extremities of per-
formance are generally the easiest to detect and consequently
rate. Our results demonstrate this pattern because Novice-
Low and Superior have the highest percentage of absolute
agreement. In terms of the nature of the disagreements, virtu-
ally all of the disagreements were within one scale point or
step (e.g., Novice-Low versus Novice-Mid or Novice-High
versus Intermediate-Low). The majority of the disagreements
(58.5%) were within the same major level, and the disagree-
ments that crossed a major level boundary were spread across
the three boundaries. Additionally, no disagreements spanned
two major proficiency category boundaries. Overall, the
results of research questions 2 through 5 provide additional
evidence that the ACTFL rating scale and protocols are func-
tioning as intended. This further bolsters our confidence in
the reliability of the ACTFL OPI procedure.
The results for question 6 demonstrate that when the first
and second raters disagree, the third rater has a tendency to
“break the tie” more often in the favor of the second rater. This
finding held across languages and characteristics in this study.
The findings are quite robust on this point, as is apparent in
the absolute agreement indices found in Table 4. As noted ear-
lier, second and third raters always rate from the audiotape
without having telephonic contact with the ratee, whereas the
first rater conducts the interview and then rates from the
audiotape at a later time. This could explain the results for
question 6. Several factors could be driving this effect. The
important question is whether conducting the interview as
well as rating it affects the psychometric characteristics of the
assessment, especially the validity of the assessment. However,
given the high initial agreement between the first and second
raters, there may be no impact of the differential roles of testers
(i.e., interviewing and rating as opposed to rating only) on the
overall validity at all. The findings from question 6 could be a
function of the specific raters involved in the disagreements,
not a function of the role differences. Therefore, research
should be conducted to determine if the validity is affected
and why.
In general, the ratings generated by any of the OPI
interview procedures should validly and reliably measure the
construct of interest (language speaking proficiency) and
describe proficiency regardless of the testing mode (e.g., in-
person, telephonic, or video conferencing) and whether or not
the rater conducted the interview. In other words, mode and
rater role (i.e., rater only versus interviewer and rater) should
not affect the ratings assigned to the interviewee’s proficiency
by the raters. When assessment procedures depend on human