Reliability and the ACTFL Oral
Proficiency Interview: Reporting
Indices of Interrater Consistency
and Agreement for 19 Languages
Eric A. Surface
Surface, Ward & Associates
Erich C. Dierdorff
DePaul University
Abstract:
The reliability of the ACTFL Oral Proficiency Interview (OPI) has not been reported since
ACTFL revised its speaking proficiency guidelines in 1999. Reliability data for assessments should be
reported periodically to provide users with enough information to evaluate the psychometric characteris-
tics of the assessment. This study provided the most comprehensive analysis of ACTFL OPI reliability to
date, reporting interrater consistency and agreement data for 19 different languages. Overall, the interrater
reliability of the ACTFL OPI was found to be very high. These results demonstrate the importance of using
an OPI assessment program that has a well-designed interview process, a well-articulated set of criteria for
proficiency determination, a solid rater training program, and an experienced cadre of testers. Based on the
data reported, educators and employers who use the ACTFL OPI can expect reliable results and use the
scores generated from the testing process with increased confidence. Recommendations for future research
are discussed.
Introduction
In 1999, the ACTFL revised the ACTFL Proficiency Guidelines—Speaking (Breiner-Sanders et al.,
2000) “to make the document more accessible to those who have not received recent training in
ACTFL oral proficiency testing, to clarify the issues that have divided testers and teachers, and to pro-
vide a corrective to what the committee perceived to have been possible misinterpretations of the
descriptions provided in earlier versions of the Guidelines” (p. 14). One of the most significant
changes to the 1986 Guidelines from a measurement perspective was “the division of the Advanced
level into the High, Mid, and Low sublevels” (p. 14). Previously, there were two categories at the
Advanced level (Advanced and Advanced-High). The recent 1999 Guidelines subsequently created
three rating options to describe the Advanced level of proficiency—Advanced-High, Advanced-Mid,
Eric A. Surface (PhD, North Carolina State University) is a principal and researcher with
Surface, Ward & Associates, an organizational consulting and research firm based in Raleigh,
North Carolina, and serves as the Director of Training Research for the Special Operations Forces
Language Office, Ft. Bragg, North Carolina, as part of his fellowship with the Army Research
Institute’s Consortium Research Fellows Program.
Erich C. Dierdorff (PhD, North Carolina State University) is a Visiting Professor in the College
of Commerce at DePaul University and a Consortium Research Fellow with the Army Research
Institute, Ft. Bragg, North Carolina.
Foreign Language Annals • Vol. 36, No. 4 507
508 WINTER 2003
and Advanced-Low—by dividing the previous Advanced pro-
ficiency category into two and retaining the Advanced-High
level to form the current conceptualization of the Advanced
major level.
Although adapting the scale did not change the Oral
Proficiency Interview (OPI) process, modifying the rating
scale used by ACTFL certified testers to describe speaking pro-
ficiency might have an impact on the measurement properties
of the assessment. Therefore, it is important to determine
whether the change in the Guidelines affected the psychome-
tric characteristics (i.e., the validity and reliability) and quality
of the ratings generated during the ACTFL OPI process.
The Standards for Educational and Psychological Testing,
published by the American Educational Research Association
(AERA) (1999), provide evaluative guidelines for the users,
developers, and publishers of tests, referring to any “evaluative
device or procedure in which a sample of an examinee’s behav-
ior in a specified domain [test content area] is obtained and
subsequently evaluated and scored using a standardized
process” (p. 3), not simply restricted to paper-and-pencil
assessments. Validity refers to “the degree to which evidence
and theory support the interpretations of the test scores
entailed by proposed uses of tests” (p. 9), whereas reliability
indicates the ability of the testing procedure to provide a con-
sistent measure of the specified domain when repeated.
The validity and reliability of a testing procedure should
be demonstrated periodically, especially if the procedure has
been modified or the specified domain has been redefined in
some meaningful way. Thus, our study assessed the reliability
of the ACTFL OPI across and within 19 languages. As with
previous reliability studies (e.g., Thompson, 1995), we did not
address the validity of the ACTFL OPI as a measure of speak-
ing proficiency. Establishing validity requires multiple studies
that provide evidence supporting that a test or assessment
effectively measures the construct it purports to measure and
can be used for a specific purpose. Validity evidence can take
many forms, depending on the use of the test and the purpose
of the validation study (e.g., criterion-related validation if the
assessment is used to predict future job performance). This
type of research is beyond the scope of the current study. The
information presented herein specifically addresses the
requirements of the Standards related to presenting reliability
data and will help users of the ACTFL OPI evaluate the ratings
produced by the procedure.
Research Background and
Literature Review
ACTFL Proficiency Guidelines—Speaking, Revised
The main impetus for reassessing the reliability of the ACTFL
OPI comes from the revision of the proficiency guidelines
(Breiner-Sanders et al., 2000). The Revised Guidelines made
several modifications to the previous version of the ACTFL
Proficiency Guidelines (ACTFL, 1986). Although the com-
mittee made several changes related to the presentation of the
Guidelines, the primary modification was to divide what was
previously defined as the Advanced level of proficiency into
the Advanced-Mid and Advanced-Low sublevels and aggre-
gate the two new categories, along with the existing
Advanced-High level, into the current conceptualization of the
Advanced major level. Refinement of the measurement scale
in this way could substantially affect the psychometric proper-
ties of the ACTFL OPI, thus making it necessary to examine
the reliability of ratings produced from the revised criteria. Of
course, this reliability research could not be conducted until
enough data were available for analysis.
The Guidelines provide an a priori set of criteria against
which interviewers measure and evaluate an individual’s
functional competency in speaking a language, as
demonstrated by the test taker’s ability to accomplish linguistic
tasks at the various proficiency levels. The ACTFL guidelines
are based on the Interagency Language Roundtable (ILR)
descriptions of language proficiency for use in governmental
and military organizations and have been modified for
use in academia and industry. The ACTFL rating scale
describes four major levels of language proficiency—Superior,
Advanced, Intermediate, and Novice—that are delineated
according to a hierarchy of global tasks related to functional
language ability (e.g., can narrate and describe in all major
time frames).
Three of the major levels (Advanced, Intermediate, and
Novice) are further divided into three sublevels—High, Mid,
and Low. Superior is the only major category that is not divid-
ed into sublevels. Combining the ACTFL major levels and
sublevels yields a total of 10 separate proficiency categories.
A complete description of the proficiency categories is provid-
ed in the published Guidelines (Breiner-Sanders et al.,2000)
and can also be obtained through the ACTFL Web site
(www.actfl.org).
Reliability and Interrater Consistency
Consistency defined by the extent that separate measurements
retain relative position is the essential notion of classical relia-
bility (Anastasi, 1988; Cattell, 1988; Feldt & Brennan, 1989;
Flanagan, 1951; Stanley, 1971; Thorndike, 1951). Simply put,
reliability is the extent to which an item, scale, procedure, or
instrument will yield the same value when administered across
different times, locations, or populations. In the specific case
of rating data, the focus of reliability estimation turns to the
homogeneity of judgments given by the sample raters. One of
the most commonly used forms of rater reliability estimation
is interrater reliability, which portrays the overall level of con-
sistency among the sample of raters involved in a particular
judgment process. When interrater reliability estimates are
high, the interpretation has a large degree of consistency across
sample raters.
Another common approach to examining interrater
consistency is to use measures of agreement. Whereas
Foreign Language Annals • Vol. 36, No. 4 509
interrater reliability estimates are parametric and correla-
tional in nature, measures of agreement are nonparametric
and assess the extent to which raters give concordant or
discordant ratings to the same objects (e.g., interviewees).
Technically speaking, measures of agreement are not
indices of reliability per se, but are nevertheless quite use-
ful in depicting levels of rater agreement and consistency
of specific judgments, particularly when data can be
considered ordinal or nominal.
Items, tests, raters, or procedures generating judg-
ments must yield reliable measurements to be useful and
have psychometric merit. Data that are unreliable are, by
definition, unduly affected by error, and decisions based
upon such data are likely to be quite tenuous at best and
completely erroneous at worst. Although validity is con-
sidered the most important psychometric measurement
property (AERA, 1999), the validity of an assessment is
negated if the construct or content domain cannot be mea-
sured consistently. In this sense, reliability can be seen as
creating a ceiling for validity.
The Standards for Educational and Psychological Testing
(AERA, 1999) provide a number of guidelines designed to
help test users evaluate the reliability data provided by test
publishers. According to the Standards, a test developer or
distributor has the primary responsibility for obtaining and
disseminating information about an assessment proce-
dure’s reliability. However, under some circumstances, the
user must accept responsibility for documenting the relia-
bility and validity in its local population. The level of reli-
ability evidence that is necessary to assess and to be report-
ed depends on the purpose of the test or assessment pro-
cedure. For example, if the assessment is used to make
decisions that are “not easily reversed” or “high stakes”
(e.g., employee selection or professional school admis-
sion), then “the need for a high degree of precision [in the
reliability data reported] is much greater” (p. 30).
Given the nature of the ACTFL OPI and our study, the
following Standards (AERA, 1999) are particularly note-
worthy: (1) reliability estimates should be reported for
each test score, subscore, or combination of scores
(Standard 2.1); (2) reliability coefficients from similar
assessments (e.g., Defense Language Institute’s [DLI] OPI)
are not interchangeable unless their implicit definitions of
measurement error are equivalent (Standard 2.5); (3) evi-
dence of both interrater consistency and within examinee
consistency over repeated measurements should be pro-
vided for assessments when subjective judgment enters
into the scoring process (Standard 2.10); (4) test develop-
ers should document the process for the selection and
training of raters as well as scorer reliability and drift over
time (Standard 3.23); and (5) test developers and publish-
ers are responsible for amending, revising, or withdrawing
a test as new research data becomes available (Standard
3.25). Taken together, providers of OPIs or other
test/assessment procedures have the responsibility to
report and periodically update the reliability data for their
procedures. Thus, the Standards provide a strong justifica-
tion for the research in this study.
Previous OPI Reliability Research
Although several studies have investigated and reported
reliability data for the ACTFL OPI (e.g., Magnan, 1987;
Thompson, 1995), all available reliability evidence pre-
dates the Revised Guidelines and uses data collected under
the 1986 criteria (ACTFL, 1986).
Magnan (1986) found that interrater agreement for a
sample of 40 students of French rated by two ACTFL-certified
testers from the Educational Testing Service was .72 (Cohen’s
kappa). All rater disagreements were one sublevel apart (e.g.,
Mid versus High) within the same major proficiency level
(e.g., Intermediate). Magnan (1987) found that interrater reli-
ability between trainer and trainee ratings of French profi-
ciency in a two-phase study were significant (the coefficients
were r = .94, .94; τ = .83, .86; Κ = .53, .55; and Γ = .94, .95
for both phases respectively), and rater disagreements were
again within only one sublevel in the majority of the
instances.
In a construct validity study, Dandonoli and Henning
(1990) reported interrater reliability for ratings of speaking
proficiency in English (r = .98) and in French (r = .97).
Thompson (1995) presented the most comprehensive evalu-
ation of interrater reliability for the ACTFL OPI under the
1986 Guidelines. Thompson (1995) provided coefficients
(Pearson’s correlations) for five languages: .87 for French, .85
for Spanish, .90 for Russian, .84 for English, and .86 for
German. Modified Cohen’s kappa coefficients and the per-
centages of absolute and partial agreement were reported as
well. The current study builds upon and extends this body of
research with the ACTFL OPI.
Although other organizations (e.g., DLI) provide assess-
ments of speaking proficiency, no comprehensive reports of
interrater reliability data were found in a review of the past
decade’s research. However, some studies (e.g., Jackson, 1999)
have reported reliability data within the confines of their
defined research scopes. Jackson (1999), who investigated
the impact of test modality (e.g., telephone or face-to-face)
on oral proficiency testing at DLI, reported Kendall’s
tau-b coefficients for Russian and Arabic OPI ratings across
several modalities for a small sample of participants. The
coefficients ranged from .90 to 1.00 for the original
ratings within the same testing mode (see Jackson, 1999
for details).
Although previous research supports the reliability of
ILR-based OPI ratings (e.g., Adams, 1978; Bachman &
Palmer, 1981; Carroll, 1967; Clark, 1986), the lack of
current and comprehensive reliability data from large
samples makes comparisons inappropriate. Additionally,
differences in the rating process between ACTFL and
510 WINTER 2003
other OPI processes (e.g., DLI) limit the comparability of the
results as well. Therefore, our study will only discuss
the current findings in the context of previous research
specifically related to the ACTFL OPI.
Research Questions
With the previously discussed considerations in mind, the
present study sought to investigate the levels of interrater
consistency derived from experienced ACTFL-certified
testers using the Revised Guidelines. The following six
specific research questions were examined:
1. What is the overall interrater consistency and agree-
ment for all languages tested with the same ACTFL
Revised Guidelines and rating protocol? Overall inter-
rater consistency and agreement refers to calculating
the coefficients across all pairs of raters in all lan-
guages.
2. Do interrater consistency and rater agreement levels
vary across languages that are more commonly tested
compared to those that are less commonly tested?
3. Do interrater consistency and rater agreement levels
vary according to language difficulty (i.e., the level of
difficulty for learning a given language)?
4. Do ratings of particular languages show more consis-
tency or greater agreement than others?
5. Does rater agreement vary across proficiency cate-
gories? If so, what is the nature of disagreement (i.e.,
within a major proficiency level or between two major
proficiency levels)?
6. When the first and second raters disagree and a third
rater must be utilized, is the third rater significantly
more likely to resolve the disagreement in favor of one
rater more often than the other (i.e., are interrater reli-
ability and agreement higher between the first and
third raters or the second and third raters)?
Methods
Participants and Rating Methodology
A total of 5881 interviews conducted and rated by experi-
enced ACTFL-certified testers and using the ACTFL assess-
ment procedure were included in this study. The ACTFL
OPI assessment procedure, as described in the ACTFL Oral
Proficiency Interview Tester Training Manual (Swender,
1999), consists of four phases (Warm Up, Level Checks,
Probes, and Wind Down) that are designed to efficiently
elicit a ratable sample.
This study used data from oral proficiency interviews in
19 different languages: English, Mandarin, French, German,
Italian, Japanese, Russian, Spanish, Hebrew, Czech, Arabic,
Vietnamese, Portuguese, Polish, Albanian, Hindi, Tagalog,
Cantonese, and Korean. Table 1 provides the number of inter-
views included in the study by language. The data were made
available by Language Testing International (LTI), the ACTFL
testing affiliate.
Two characteristics of the tested languages were used to
code each case (each interviewee’s data represents a case) into
groups used for subsequent analyses. The first language char-
acteristic used in this study was testing density, which repre-
sented LTI’s frequency of assessing a particular language. Cases
in languages with high-testing volumes were coded as More
Commonly Tested (MCT) languages, while languages with
lower volumes were coded as Less Commonly Tested (LCT).
LTI’s own internal categorization was used to code testing den-
sity. All cases in English, Mandarin, French, German, Italian,
Japanese, Russian, and Spanish were considered MCT lan-
guages. All cases in Hebrew, Czech, Arabic, Vietnamese,
Portuguese, Polish, Albanian, Hindi, Tagalog, Cantonese, and
Korean were considered LCT languages.
The second language characteristic was language
difficulty, which was derived by applying the language
difficulty categories used by the American Council on
Education (ACE) in its recommendations for granting college
credit for official ACTFL OPI ratings to each of the cases.
These categories were labeled Category I through Category IV
and represent the relative difficulty for learning the language
from the perspective of a native English speaker. Higher cate-
gories represent more difficult languages to learn. For its pur-
poses, ACE considers English a Category I language. However,
the language category assignment should differ depending on
the speaker’s first language. Although coding English as a
Category I language in our analyses could be potentially prob-
lematic, we chose to mirror the operationalization of the
Categories in the ACE recommendations for college credit to
maintain alignment with their use. The language difficulty
categories are equivalent to the ones used by military and
governmental organizations.
As stipulated by the standard procedure for all ACTFL
OPI assessments, each case was rated by a pair of testers. Some
cases required a third tester to serve as a “tie-breaker” in situ-
ations of discrepancy between the pair’s proficiency ratings. In
all cases, the first rater conducted and audiotaped the inter-
views. Subsequently, this rater judged the interviewee’s speak-
ing proficiency from the tape at some later time.
Next, the taped interviews were independently rated by a
second rater. All raters used the ACTFL rating scale described
in the ACTFL Proficiency Guidelines—Speaking, Revised
(Breiner-Sanders et al., 2000) to describe the proficiency levels
of the interviewees. If the independent ratings provided by the
rating pair disagreed, a third rater was assigned as an arbitra-
tor to rate the interview tape. This rater did not know the pre-
viously assigned scores, nor that he or she was the third rater.
No fourth raters were needed to reach a final rating (i.e., the
rating of the third rater always agreed with the rating of either
the first or second rater).
Throughout this article, the “first” rater always corre-
sponds to the tester who conducted the interview, whereas the
“second” and “third” raters represent those who rated inter-
Foreign Language Annals • Vol. 36, No. 4 511
viewees from the audiotapes. All raters were ACTFL-certified,
meaning that they had completed the ACTFL OPI tester cer-
tification process as described in the ACTFL OPI Tester
Certification Information Application Packet (ACTFL, 2002).
These testers are required to keep current through ongoing
training, testing, and norming procedures. Testing experience
varied across raters. Both native and nonnative speakers
served as raters. The total number of certified testers also
varied across languages.
Analytic Procedure
In order to more accurately assess the extent of interrater
consistency, we used a multimethod approach. Interrater
consistency can be conceptualized from several perspec-
tives (e.g., interrater reliability, interrater agreement, and so
forth) and, thus, a multimethod approach allows for a
more complete picture of the level of rating consistency.
We also sought to include similar statistics to those previ-
ously employed in prior research examining interrater con-
sistency of the ACTFL OPI. The overall rationale was to
expand the breadth of rater consistency assessment, as well
as to yield estimates comparable to past assessments.
Pearson correlation. Sometimes called a product–moment
correlation, Pearson correlation (r) is one the most widely
used methods of assessing interrater reliability. This correla-
tion assesses the degree to which ratings covary. In this sense,
reliability can be depicted in the classical framework as the
ratio of true score variance to total variance (i.e., variance in
ratings attributable to true speaking proficiency divided by
total variance of ratings).
Spearman rank–order correlation. This is another com-
monly used correlation for assessing interrater reliability, par-
ticularly in situations involving ordinal variables. Spearman
rank–order correlation (R) has a interpretation similar to
Pearson’s r; the primary difference between the two correla-
tions is computational, as R is calculated from ranks and r is
based on interval data. This statistic is appropriate for the OPI
data in that the proficiency categories are ordinal in nature.
Kendall’s tau. Tau (τ) is equivalent to Spearman’s R with
regard to the underlying assumptions. However, tau and R
carry different interpretations. R is a correlation and thus rep-
resents a proportion of variability accounted for, whereas tau
is a measure of agreement and represents the difference
between two probabilities. Tau is the difference between the
probability that the cases are rated in the same order by the
two raters and the probability that the cases are rated in
different orders by the two raters.
Goodman and Kruskal’s gamma. Similar to tau, gamma (Γ)
is a probability-based measure of agreement. However, unlike
tau, gamma does not penalize for ties in that they are compu-
tationally ignored. As it is desirable to have high interrater
consistency (i.e., a large number of tied ratings), gamma can
provide useful information beyond that given by tau in terms
of interrater consistency. As tied ratings are computationally
ignored, the result is that gamma is typically higher in
magnitude than tau.
Cohen’s kappa. Cohen’s kappa (κ) is another commonly
used measure of agreement, which compares the observed
agreement to the agreement expected by chance. Kappa values
range from 1.00, when agreement is perfect, to 0.00, when
agreement is at the chance level. Kappa does not take into
account the degree of disagreement between raters as all dis-
agreements are considered to contribute equally to the total
level of disagreement. Therefore, if rating categories are
ordered, it is preferable to use a weighted version of kappa,
which assigns different weights to ratees for whom the raters
differ by i categories. Thus, different levels of disagreement
can contribute proportionally to the overall value of kappa.
Weighted kappa was used in this study.
Raw percentages of agreement. This agreement method
assesses the extent to which raters display perfect agreement.
It serves as an absolute agreement estimate of interrater con-
sistency and is calculated as the number of identical ratings
divided by the number of total rating opportunities. As some
disagreements can be expected, it is important to assess per-
centages of partial agreement as well. Thus, we estimated three
separate partial agreement percentages: (1) interrater agree-
ment within plus or minus one proficiency category (e.g.,
Novice-Low versus Novice-Mid); (2) interrater agreement
within plus or minus two proficiency categories (e.g.,
Intermediate-Low versus Intermediate-High); and, (3) inter-
rater agreement within plus or minus three proficiency cate-
gories (e.g., Advanced-Low versus Superior). In addition,
some disagreements can be viewed as more severe in terms of
language proficiency determination. For example, a partial
interrater agreement that spans a major proficiency category
boundary (e.g., first rater judges an Intermediate-High, while
second rater judges an Advanced-Low) could be a more prob-
lematic discrepancy than a partial agreement within a major
proficiency category, such as one spanning a minor proficien-
cy boundary (e.g., Intermediate-Low versus Intermediate-
Mid). To account for the specific nature of rater disagree-
ments, we calculated the overall frequencies of rater disagree-
ments that spanned one or more major proficiency categories.
Also, we examined the specific locations of these major
boundary-crossing disagreements (e.g., disagreements cross-
ing Intermediate and Advanced versus those crossing
Advanced and Superior).
Results
Research question 1: What is the overall interrater consistency
and agreement for all languages tested with the same ACTFL
Revised Guidelines and rating protocol? As shown in Table 1,
the overall interrater consistency across all rater pairs in all
included languages was significant (p < .05) for each of the test
statistics. As expected, gamma had the highest value and all
consistency measures had values greater than .90.
Table 2 displays the raw agreement percentages for the
512 WINTER 2003
overall language data. Eighty percent of the ratings across all
19 languages showed perfect agreement, whereas about 18%
of the ratings disagreed by one proficiency category. That is,
four-fifths of all rater pairs gave identical proficiency ratings
and nearly all rater pairs (99%) were within one proficiency
category (e.g., Novice-Low versus Novice-Mid).
Research question 2: Do interrater consistency and rater
agreement levels vary across languages that are more com-
monly tested compared to those that are less commonly test-
ed? The results in Tables 1 and 2 show that there were very
small differences in both rater consistency and rater agree-
ments levels between languages that are more commonly
tested and those that are less frequently tested.
Research question 3: Do interrater consistency and rater
agreement levels vary according to language difficulty? The
results of the consistency measures (Table 1) demonstrate that
the language difficulty classifications had practically no mod-
erating effects on the magnitude of rater consistency. Similarly,
the raw agreement percentages (Table 2) did not show any
substantial discrepancies across the four language difficulty
groups. Category III languages produced slightly higher levels
of agreement, but this difference was quite small relative to the
other three categories.
Research question 4: Do ratings of particular languages
INTERRATER CONSISTENCY ANALYSES
Data Type NrR τΓK
wt
Overall 5881 .978 .976 .940 .990 .920
Language Density
MCT 5389 .978 .975 .940 .991 .918
LCT 492 .979 .978 .941 .981 .929
Language Difficulty
Category I 4458 .975 .971 .934 .991 .912
Category II 216 .981 .976 .945 .994 .929
Category III 441 .985 .983 .954 .990 .941
Category IV 766 .978 .977 .938 .981 .920
Language
English 725 .960 .957 .912 .984 .883
Mandarin 241 .989 .989 .966 .997 .951
French 626 .977 .979 .949 .949 .927
German 216 .981 .976 .945 .994 .929
Italian 219 .944 .938 .886 .978 .844
Japanese 307 .981 .971 .933 .984 .924
Russian 278 .974 .966 .922 .980 .902
Spanish 2777 .978 .970 .934 .991 .917
Hebrew 19 .996 .999 .993 1.00 .980
Czech 15 .999 .999 1.00 1.00 1.00
Arabic 140 .946 .943 .864 .940 .822
Vietnamese 42 .999 .999 1.00 1.00 1.00
Portuguese 111 .982 .976 .947 .994 .930
Polish 25 .999 .999 1.00 1.00 1.00
Albanian 8 .959 .992 .972 1.00 .889
Hindi 47 .999 .999 1.00 1.00 1.00
Tagalog 7 .978 .971 .946 1.00 .920
Cantonese 13 .981 .981 .953 1.00 .904
Korean 65 .999 .999 1.00 1.00 1.00
Note. r = Pearson correlation; R = Spearman rank-order correlation; τ = Kendall’s tau; Γ = Goodman-Kruskal gamma; K
wt
= Cohen weighted kappa
coefficient; MCT = more commonly tested; LCT = less commonly tested; all statistics are significant (p < .05).
Table 1
Foreign Language Annals • Vol. 36, No. 4 513
PERCENTAGES OF INTERRATER AGREEMENT
Data Type Agreement Disagreement Distance
Absolute 1 Step 2 Steps 3 Steps
Overall 80.79 18.59 .58 .05
(4751) (1093) (34) (3)
Language Density
MCT 80.63 19.00 .37 .
(4345) (1024) (20) .
LCT 82.52 14.02 2.85 .61
(406) (69) (14) (3)
Language Difficulty
Category I 80.64 19.09 .27 .
(3595) (851) (12) .
Category II 82.87 16.67 .46 .
(179) (36) (1) .
Category III 83.90 14.97 1.13 .
(370) (66) (5) .
Category IV 79.24 18.28 2.09 .39
(607) (140) (16) (3)
Proficiency Category
Novice
Low 94.44 5.56 . .
(51) (3) . .
Mid 76.40 21.35 2.25 .
(68) (19) (2) .
High 81.63 15.65 2.72 .
(120) (23) (4) .
Intermediate
Low 72.76 26.25 1.00 .
(219) (79) (3) .
Mid 79.73 19.46 .82 .
(586) (143) (6) .
High 78.27 21.35 .38 .
(616) (168) (3) .
Advanced
Low 76.49 22.83 .41 .27
(563) (168) (3) (2)
Mid 75.93 23.41 .55 .11
(694) (214) (5) (1)
High 75.47 23.73 .80 .
(563) (177) (6) .
Superior 92.71 7.15 .15 .
(1271) (98) (2)
.
Table 2
514 WINTER 2003
show more consistency or greater agreement than others?
When taken collectively, the results of the consistency analyses
showed no substantially large differences across the 19 tested
languages. For instance, values of r ranged from .94 to .99. The
largest spread of any specific consistency statistic across lan-
guages was for the weighted kappa statistic (.82 to 1.0). Some
small language effects were apparent for Italian and Arabic
data, which both had slightly lower levels of rater consistency.
Importantly, caution should be used when interpreting these
results, in that several languages (e.g., Albanian) had very
small sample sizes and were presented for the sake of illustra-
tion and completeness.
Data Type Agreement Disagreement Distance
Absolute 1 Step 2 Steps 3 Steps
Language
English 77.93 21.52 .55 .
(565) (156) (4) .
Mandarin 86.31 13.69 . .
(208) (33) . .
French 84.66 15.34 . .
(530) (96) . .
German 82.87 16.67 .46 .
(179) (36) (1) .
Italian 73.97 25.57 .46 .
(162) (56) (1) .
Japanese 79.80 19.22 .98 .
(245) (59) (3) .
Russian 75.54 22.66 1.80 .
(210) (63) (5) .
Spanish 80.88 19.91 .22 .
(2246) (525) (6) .
Hebrew 94.74 5.26 . .
(18) (1) . .
Czech 100.0 .00 . .
(15) 0 . .
Arabic 56.43 32.14 9.29 2.14
(79) (45) (13) (3)
Vietnamese 100.0 . . .
42 . . .
Portuguese 82.88 16.22 .90 .
(92) (18) (1) .
Polish 100.0 . . .
(25) . . .
Albanian 87.50 12.50 . .
(7) (1) . .
Hindi 100.0 . . .
(47) . . .
Tagalog 85.71 14.29 . .
(6) (1) . .
Cantonese 76.92 23.08 . .
(10) (3) . .
Korean 100.0 . . .
(65) . . .
Note. Sample sizes shown in parentheses; Steps = 10 specific proficiency rating values; MCT = "more commonly tested”; LCT = “less commonly test-
ed”; proficiency categories derived from ACTFL Proficiency Guidelines–Speaking, Revised.
Table 2 (continued)
Foreign Language Annals • Vol. 36, No. 4 515
The bottom half of Table 2 provides the results of the raw
agreement percentage analysis by language. The raw percent-
age for absolute agreement ranged from a low of 56% (Arabic)
to a high of 100% (Czech, Vietnamese, Polish, Hindi, and
Korean) across the 19 languages. The vast majority of lan-
guages had greater than 80% perfect rater agreement. For
those languages with less than 100% rater agreement, the
majority displayed differences of only one proficiency catego-
ry between raters. Again, some of these results should be
interpreted with caution due to small sample sizes.
Research question 5: Does rater agreement vary across pro-
ficiency categories? If so, what is the nature of disagreement?
From the results shown in Table 2, ratings of Novice-Low pro-
ficiency tended to have the highest level of absolute agree-
ment, followed by ratings of Superior proficiency (93%).
Overall, the level of absolute agreement in the Novice profi-
ciency level tended to be fairly high (94%, 76%, and 81%
across the specific Novice sublevels). Intermediate and
Advanced proficiency ratings showed very similar overall
absolute and partial agreement percentages, as well as across
their respective proficiency sublevels.
With agreement differences evident across the 10 profi-
ciency categories, the nature of the rating disagreement
became an important issue. Table 3 shows the results perti-
nent to this line of inquiry. After completing the initial analy-
ses presented in Table 2, a more specific agreement percentage
analysis was undertaken to examine the nature of disagree-
ments between raters. Out of the 5881 total rater pairs, there
were 1130 rater pairs that “disagreed.” These pairs were fur-
ther analyzed to capture the location along the ACTFL lan-
guage proficiency categories where the disagreements were
most prevalent (i.e., whether the disagreements were across
major or minor boundaries). One focus of this analysis was on
pairs of ratings where the disagreement was between ratings
from different major categories or levels, that is, rater dis-
agreements leading to incongruous categorical assignments of
the interviewees (e.g., one rater giving a “Novice” and the
other giving an “Intermediate” assignment). This is also
referred to as “crossing a major boundary.”
As shown in Table 3, approximately 41% of disagreement
cases crossed a single major proficiency level boundary. No
disagreements were associated with crossing two major profi-
ciency categories. Thus, close to three-fifths of all rater dis-
agreements were within a given major proficiency category
(i.e., the disagreements were within a single major level),
which is also referred to as “crossing a minor boundary.” Of
the 41% disagreement cases that crossed a major level bound-
ary, the majority (48%) were between the Advanced and
Superior proficiency categories, followed by the Intermediate
and Advanced (39%) and the Novice and Intermediate (13%)
categories. These percentages matched the proportions of total
test takers that fell within these categories. That is to say, a vast
majority of interviewees were judged to be Advanced (n =
2396) or Superior (n = 1371), thus paralleling the larger
percentage of disagreements spanning these two major
proficiency categories.
Research question 6: When the first and second raters dis-
agree and a third rater must be utilized, is the third rater sig-
nificantly more likely to resolve the disagreement in favor of
one rater more often than the other? In accordance with the
ACTFL guidelines, disagreement between two raters necessi-
tates a third rater to serve as a “tie-breaker.” Table 4 shows the
results of interrater reliability analysis, using Pearson’s r, com-
paring consistency between any two original raters and the
third arbitrating raters. The previous moderator variables
(e.g., language) were also included in these reliability analyses.
The results were quite consistent in showing that interrater
reliability was clearly higher for second raters paired with
third raters than it was for the first and third rater pairs. This
relationship held across the testing density, language difficulty
classifications, and specific languages as well.
To test whether or not the interrater reliability differences
across rater combination were statistically significant, we used
a modified t-test (Stieger, 1980). A modified t-test was chosen
over a traditional t-test (i.e., Hotelling’s “exact” t-test) because
it is more appropriate for “correlated correlations” (Meng,
Rosenthal, & Rubin, 1992), that is, correlations derived from
samples that are not independent (from the same population).
Across testing density and language difficulty, all interrater
reliabilities were significantly different (p < .05, one-tailed).
Within specific languages, only two interrater reliabilities were
PERCENTAGE OF DISAGREEMENT ACROSS
MAJOR PROFICIENCY LEVELS
Data Type %
Overall
1 Boundary 41.50
(469)
2 Boundaries 0.00
Specific
Novice–Intermediate 12.58
(59)
Intermediate–Advanced 39.45
(185)
Advanced–Superior 47.97
(225)
Note. Sample sizes in parentheses; overall disagreement cases = 1130 of
5881; total number of ratees in each proficiency category: novice (290),
intermediate (1823), advanced (2396), superior (1371); proficiency
categories derived from ACTFL Proficiency Guidelines–Speaking
Revised 1999. The disagreements in this table only refer to crossing a
major proficiency boundary.
Table 3
516 WINTER 2003
not significantly different (French and Arabic). Also shown in
Table 4 are the raw percentages of absolute agreement for the
two rater combinations. These agreement indices further
emphasize the much larger levels of interrater consistency
between the second and third rater combination as compared
to the first and third rater combination.
Discussion
To place the consistency and agreement estimates found in the
present study in perspective, two types of comparisons can be
made: (1) general comparisons to “acceptable” levels of relia-
bility derived from the educational testing and psychometric
literature; and (2) specific comparisons to reliability levels
found in previous research examining OPI raters. Regarding
the first type of comparison, Nunnally (1978) suggested that
an acceptable reliability for preliminary research is .70. Kaplan
and Saccuzzo (1997) and Nunnally and Bernstein (1994) rec-
ommended a reliability benchmark of .80 for purposes of basic
research and .90 to .95 for any applied research ventures. The
estimates of interrater consistency found in this study were all
above the recommendation for applied projects (.90).
Moreover, none of the interrater reliabilities fell at or below the
.70 level, which Murphy and Davidshofer (1994) call a “low
level” of reliability.
As for the second type of comparison, the consistency
estimates found within the present study were similar to, but
generally higher than, estimates found in previous OPI
research. For instance, using Pearson’s r, Magnan (1987)
found an interrater reliability of .94 for trainer-trainee ratings
of French speaking proficiency. Dandonoli and Henning
(1990) reported interrater reliabilities of .98 and .97 for
English and French, respectively. Finally, in a study of profi-
ciency ratings of English, French, German, and Spanish,
Thompson (1995) found interrater reliabilities that ranged
from .83 to .89. The reliability results presented herein for
these languages ranged from .96 to .98. The improved reliabil-
ity most likely results from rater training and higher levels of
experience in the cadre of raters.
Combining the present study’s results with similar find-
ings from previous research provides evidence that bolsters
COMPARING INTERRATER RELIABILITY BY RATER COMBINATION
Data Type Nr % of Absolute Agreement
Raters 1 & 3 Raters 2 & 3 Raters 1 & 3 Raters 2 & 3
Overall 1130 .907 .960** 28.23 68.38
Language Density
MCT 1044 .902 .961** 27.97 70.69
LCT 86 .935 .948* 31.40 53.49
Language Difficulty
Category I 864 .884 .954** 28.59 70.95
Category II 36 .918 .966** 27.78 69.44
Category III 71 .872 .939** 30.99 60.56
Category IV 159 .926 .957** 25.16 65.78
Language
English 160 .827 .921** 33.13 64.38
Mandarin 33 .900 .963** 18.18 78.79
French 96 .920 .925 46.88 53.13
German 36 .918 .966** 27.78 69.44
Italian 57 .832 .909** 35.09 64.91
Japanese 62 .913 .969** 24.19 72.58
Russian 68 .861 .935** 30.88 60.29
Spanish 532 .892 .968** 22.93 77.07
Arabic 61 .946 .948 31.15 47.54
Portuguese 19 .861 .932** 36.84 63.16
Cantonese 3 .999 .999** 0.000 100.0
Note. Raters 1 and 2 required, while rater 3 serves as a “tie-breaker” for disagreements; r = Pearson correlation; * denotes significant difference between
correlations (p < .05), one-tailed; ** denotes significant difference (p < .01), one-tailed; MCT = more commonly tested; LCT = less commonly tested.
Table 4
Foreign Language Annals • Vol. 36, No. 4 517
confidence in, and the generalizability of, the relatively high
level of interrater reliability and consistency demonstrated by
experienced ACTFL OPI interviewers. Moreover, the present
study included more languages, a larger sample of raters, and
a more comprehensive approach than previous studies.
Important as well is that this study directly addresses the
Standards for Educational and Psychological Testing (AERA,
1999) related to reporting reliability evidence to users of an
assessment. Overall, the results provide good news for those
who use the ACTFL OPI to make decisions about speaking
proficiency in the 19 languages examined in our study. We
recommend that ACTFL conduct and publish the results of
interrater consistency and agreement analyses every three to
five years to continue to meet the guidelines established by the
Standards. This becomes particularly salient for those lan-
guages, such as Tagalog and Albanian, that contained small
samples sizes. Furthermore, we openly encourage other
providers of OPI assessment to follow this suggestion as well.
Research questions 2 through 5 were included to provide
additional information related to the functioning of the
ACTFL interview protocol under the Revised Guidelines for
speaking proficiency. If interrater reliability and agreement are
not affected by these other characteristics (e.g., language diffi-
culty), then these findings provide additional evidence that
the rating scale and protocols are functioning as intended and
with reasonable precision. Question 2 addresses whether
interrater consistency remains similar when testing frequency
in a language is considered. Since the results indicate that the
levels of interrater consistency by testing density are virtually
identical across the categories (MCT and LCT), we can con-
clude that the protocol is not unduly affected by the density of
testing. If it had been, the reasons for this difference would be
a point for future investigation.
Research question 3 addresses the issue of whether the
difficulty level of the language tested (in terms of language
learning) has an impact on the reliability of the OPI ratings.
Again, the results indicate that language difficulty has no sig-
nificant impact on interrater consistency and agreement,
suggesting that this is not an issue for further investigation.
Research question 4 investigates interrater reliability
within each language. For the 19 languages in this study, the
interrater consistency and agreement results were above the
acceptable range and very consistent across the different
indices. The results for two languages, Italian and Arabic, were
slightly lower than for the majority of the languages. With the
data available to us, we were unable to empirically determine
why this might be the case. A number of different factors
could be affecting the consistency and agreement of the
ratings, including characteristics of the raters, characteristics
of the ratees, characteristics of the language or dialects, or the
interaction of these factors. Of course, it could be a case of
simply having aberrant raters who are in need of more train-
ing. We recommend that ACTFL investigate this issue by
examining the most likely factors. Investigating the function-
ing of individual raters and pairs of raters would a good place
to start.
Question 5 addresses whether or not the interrater agree-
ment results vary across the major proficiency levels (Novice,
Intermediate, Advanced, and Superior) and the nature of the
disagreement (i.e., within or between major proficiency lev-
els). Unlike our expectations for questions 2 through 4, we
expected question 5 to demonstrate a difference between pro-
ficiency categories. Whenever subjective ratings are being
made across a continuum and rater agreement is calculated,
the highest agreement between raters should be expected for
the extreme scale points or values, as the extremities of per-
formance are generally the easiest to detect and consequently
rate. Our results demonstrate this pattern because Novice-
Low and Superior have the highest percentage of absolute
agreement. In terms of the nature of the disagreements, virtu-
ally all of the disagreements were within one scale point or
step (e.g., Novice-Low versus Novice-Mid or Novice-High
versus Intermediate-Low). The majority of the disagreements
(58.5%) were within the same major level, and the disagree-
ments that crossed a major level boundary were spread across
the three boundaries. Additionally, no disagreements spanned
two major proficiency category boundaries. Overall, the
results of research questions 2 through 5 provide additional
evidence that the ACTFL rating scale and protocols are func-
tioning as intended. This further bolsters our confidence in
the reliability of the ACTFL OPI procedure.
The results for question 6 demonstrate that when the first
and second raters disagree, the third rater has a tendency to
“break the tie” more often in the favor of the second rater. This
finding held across languages and characteristics in this study.
The findings are quite robust on this point, as is apparent in
the absolute agreement indices found in Table 4. As noted ear-
lier, second and third raters always rate from the audiotape
without having telephonic contact with the ratee, whereas the
first rater conducts the interview and then rates from the
audiotape at a later time. This could explain the results for
question 6. Several factors could be driving this effect. The
important question is whether conducting the interview as
well as rating it affects the psychometric characteristics of the
assessment, especially the validity of the assessment. However,
given the high initial agreement between the first and second
raters, there may be no impact of the differential roles of testers
(i.e., interviewing and rating as opposed to rating only) on the
overall validity at all. The findings from question 6 could be a
function of the specific raters involved in the disagreements,
not a function of the role differences. Therefore, research
should be conducted to determine if the validity is affected
and why.
In general, the ratings generated by any of the OPI
interview procedures should validly and reliably measure the
construct of interest (language speaking proficiency) and
describe proficiency regardless of the testing mode (e.g., in-
person, telephonic, or video conferencing) and whether or not
the rater conducted the interview. In other words, mode and
rater role (i.e., rater only versus interviewer and rater) should
not affect the ratings assigned to the interviewee’s proficiency
by the raters. When assessment procedures depend on human
518 WINTER 2003
judgments, every effort should be made to maintain rater inde-
pendence and reduce the interaction of rater–ratee character-
istics and of rater–rater characteristics that might bias or con-
taminate the ratings. “Criterion contamination” refers to the
condition when an assessment produces scores or ratings that
measure other constructs or factors beyond the one of interest
and constitutes a major threat to validity. Additionally, criteri-
on contamination does not necessarily have an impact on reli-
ability—in other words, a process that produces a biased or
contaminated score can be reliable. Although not an issue for
the ACTFL OPI, the interaction of rater–rater characteristics
can be pertinent for procedures whereby both raters are simul-
taneously present and participate in the interview together.
This could potentially undermine the independence of the rat-
ings even when the protocol makes an effort to have the raters
separately judge the proficiency prior to discussing the inter-
view. In light of this information, research into this issue is
justified and needed.
The results of question 6 suggest a potential OPI process
change to be studied. We recommend that providers of OPI
assessments, regardless of testing modality, research and eval-
uate moving to a protocol with differentiated roles in which
there are interview specialists and rating specialists.
Differentiation of the interview and rating roles would defi-
nitely eliminate any bias that may be introduced into the mea-
surement system by the first tester conducting and rating the
interview. The interviewer would elicit the best sample of
speaking performance, and two independent raters would rate
the audio or video record of the interview. To ensure a ratable
sample, the training, evaluation, and compensation of the
interview specialists would need to be aligned with the new
process. However, the downside to this suggestion is that sep-
arating interviewer and rater roles will likely increase the cost
of the OPI. Therefore, research should be conducted to deter-
mine and evaluate the impact of the modification. We suspect
that validity and reliability would be improved because the
process of conducting the interview likely interacts with cer-
tain rater characteristics to influence (or bias) the ratings.
However, before the process change is executed, research
should determine if construct-related validity and reliability
are significantly improved through role differentiation. If they
are not significantly improved, then the increased cost is clear-
ly not justified. Additionally, if significant improvements are
found from role differentiation and the underlying mecha-
nisms affecting the ratings are discovered, the same improve-
ments might be achieved by modifying rater training without
the need for role differentiation. After data are available,
ACTFL should be able to weigh the cost effectiveness of the
options. Finally, given the high interrater reliability and agree-
ment between the first and second raters, this suggestion
should be viewed as an interesting research question and a
potential improvement, not as a necessity.
Before making our recommendations for future research,
we should acknowledge some limitations of this study. First,
some languages in our study had a small number of cases, and
results for these languages should be viewed with caution.
Second, the data did not include characteristics of raters and
ratees; therefore, we could not test for the influences of indi-
vidual differences characteristics like race, gender, age, educa-
tion, and length and breadth of testing experience. Third, the
data did not include the testing context (i.e., employment,
academia, etc.), and this would have allowed us to assess con-
sistency and agreement of specific testing contexts. Finally, we
did not assess the functioning of individual raters and rating
combinations (i.e., pairs of raters) because it was beyond the
scope of this article.
In addition to the recommendations made throughout
this section, we recommend that ACTFL and language
researchers consider the following future research studies:
1. A meta-analysis of reliability and validity across all
types speaking proficiency assessments (e.g., ACTFL
OPI, DLI OPI, and so forth).
2. A reliability assessment of individual raters (including
intrarater reliability) and rating combinations across all
languages tested with the ACTFL OPI.
3. A study where relevant ratee and rater characteristics
are collected to determine if their interaction affects
rating validity and reliability.
4. A study of the consistency of OPI ratings over repeat-
ed measures (with the same ratees) within a time frame
in which learning might not be expected to be a factor.
5. A study to determine whether testing mode and rater
role affect the ratings (see above).
6. A validity study capturing the policy and mental
models of raters.
7. A series of construct-related validity studies with the
Revised Guidelines for different test uses and testing
contexts.
8. A series of criterion-related validity studies (both pre-
dictive and concurrent) to determine the validity of the
ACTFL in relation to relevant criteria such as job per-
formance.
9. A longitudinal study of language proficiency in acade-
mic and work contexts using methods such as latent
growth modeling.
These are only a few research recommendations. In gen-
eral, we suggest that ACTFL and other language researchers
collaborate to address issues related to proficiency measure-
ment in language learning and job-related language perfor-
mance. It is important to note that claims about any
assessment cannot be substantiated without a robust body of
empirical evidence. This evidence can only come from
well-designed research.
To conclude, we strongly reiterate that the results of our
study are very positive for users of the ACTFL OPI and sup-
port its reliability as an assessment of speaking proficiency.
Our study provides the most comprehensive investigation to
date of interrater consistency and agreement for the ACTFL
OPI; therefore, extending the reliability evidence available to
test users and researchers. These results demonstrate the
Foreign Language Annals • Vol. 36, No. 4 519
importance of having an OPI assessment program that has a
well-designed interview process, well-articulated criteria for
rating, a solid rater training program, and an experienced
cadre of testers.
Based on the data reported, educators and employers who
use the ACTFL OPI can expect reliable results and use the
scores generated from the process with increased confidence.
In terms of future research, we encourage ACTFL and lan-
guage researchers to conduct validity studies to ensure that
revision of the Guidelines did not adversely affect assessment
validity and to satisfy the recommendations of the Standards
for Educational and Psychological Testing (AERA, 1999). This
is one of the most pressing research needs in measuring
language speaking proficiency.
Acknowledgments
The authors thank Ray Clifford, Helen Hamlyn, Ward
Keesling, and Elvira Swender for their assistance with and
comments related to this article.
References
Adams, M. (1978). Measuring foreign language speaking pro-
ficiency: A study of agreement among raters. In J. L. D. Clark
(ed.), Direst testing of speaking proficiency: Theory and applica-
tion (pp. 131–49). Princeton, NJ: Educational Testing Service.
American Council on the Teaching of Foreign Language
(1986). ACTFL proficiency guidelines. Yonkers, NY: Author.
American Council on the Teaching of Foreign Language
(2002). ACTFL oral proficiency interview tester certification
information application packet. Yonkers, NY: Author.
American Educational Research Association (1999). Standards
for educational and psychological testing. Washington, DC:
Author.
Anastasi, A. (1988). Psychological testing (6th ed.). New York:
Macmillan.
Bachman, L. F., & Palmer, A. S. (1981). The construct validi-
ty of the FSI oral interview. Language Learning, 31, 67–86.
Breiner-Sanders, K. E., Lowe, P., Miles, J., & Swender, E.
(2000). ACTFL proficiency guidelines—Speaking, revised
1999. Foreign Language Annals, 33, 13–18.
Carroll, J. B. (1967). Foreign language proficiency levels
attained by language majors near graduation from college.
Foreign Language Annals, 1, 131–51.
Cattell, R. B. (1988). The meaning and strategic use of factor
analysis. In R. B. Cattell & J. R. Nesselroade (eds.), Handbook
of multivariate experimental psychology: Perspectives on individ-
ual differences, 2nd ed. (pp. 131–203). New York: Plenum
Press.
Clark, J. D. L. (1986). A study of the comparability of speaking
proficiency interview ratings across three government language
training agencies. Washington, DC: Center for Applied
Linguistics.
Dandonoli, P., & Henning, G. (1990). An investigation of the
construct validity of the ACTFL Oral Proficiency Guidelines
and Oral Interview Procedure. Foreign Language Annals, 23,
11–22.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn
(ed.), Educational measurement, 3rd ed. (pp. 105–46).
Washington, DC: American Council on Education.
Flanagan, J. C. (1951). Units, scores, and norms. In E. F.
Lindquist (ed.), Educational measurement (pp. 695–763).
Washington, DC: American Council on Education.
Jackson, G. L. (1999). Oral proficiency testing modality study
(DLIFLC Research Report No. 99-01). Monterey, CA: Defense
Language Institute Foreign Language Center.
Kaplan, R. W., & Saccuzzo, D. P. (1997). Psychological testing:
Principles, applications, and issues. 4th ed. Belmont, CA:
Brooks and Cole.
Magnan, S. S. (1986). Assessing speaking proficiency in the
undergraduate curriculum: Data from French. Foreign
Language Annals, 19, 429–38.
Magnan, S. S. (1987). Rater reliability of the ACTFL Oral
Proficiency Interview. The Canadian Modern Language Review,
43, 267–76.
Meng, X. L., Rosenthal, R., & Rubin, D. B. (1992). Comparing
correlated correlation coefficients. Psychological Bulletin, 111,
172–75.
Murphy, K. R., & Davidshofer, C. O. (1994). Psychological
testing: Principles and applications. 3rd ed. Englewood Cliffs,
NJ: Prentice-Hall.
Nunnally, J. C. (1978). Psychometric theory, 2nd ed. New York,
NY: McGraw Hill Book Company.
Nunnally, J. C., Bernstein, I. H. (1994). Psychometric theory,
3rd ed. New York, NY: McGraw Hill Book Company.
Stanley, J. C. (1971). Reliability. In R. L. Thorndike (ed.),
Educational measurement, 2nd ed. (pp 356–442). Washington,
DC: American Council on Education.
Stansfield, C. W., & Kenyon, D. M. (1992). Research on the
comparability of the Oral Proficiency Interview and the
Simulated Oral Proficiency Interview. System, 20, 347–64.
Steiger, J. H. (1980). Test for comparing elements of a correla-
tion matrix. Psychological Bulletin, 87, 245–51.
Swender, E. (ed.) (1999). ACTFL oral proficiency interview
tester training manual. Yonkers, NY: ACTFL.
Thompson, I. (1995). A study of interrater reliability of the
ACTFL Oral Proficiency Interview in five European lan-
guages: Data from English, French, German, Russian, and
Spanish. Foreign Language Annals, 28, 407–22.
Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (ed.),
Educational measurement (pp. 560–620). Washington, DC:
American Council on Education.