Two-sample Testing Using Deep Learning

Matthias Kirchler

1,2

Shahryar Khorasani

Marius Kloft

2,3

Christoph Lippert

1,4

Hasso Plattner Institute for Digital Engineering, University of Potsdam, Germany

Technical University of Kaiserslautern, Germany

University of Southern California, Los Angeles, United States

Hasso Plattner Institute for Digital Health at Mount Sinai, New York, United States

Abstract

We propose a two-sample testing procedure

based on learned deep neural network repre-

sentations. To this end, we deﬁne two test

statistics that perform an asymptotic loca-

tion test on data samples mapped onto a

hidden layer. The tests are consistent and

asymptotically control the type-1 error rate.

Their test statistics can be evaluated in lin-

ear time (in the sample size). Suitable data

representations are obtained in a data-driven

way, by solving a supervised or unsupervised

transfer-learning task on an auxiliary (poten-

tially distinct) data set. If no auxiliary data

is available, we split the data into two chunks:

one for learning representations and one for

computing the test statistic. In experiments

on audio samples, natural images and three-

dimensional neuroimaging data our tests yield

signiﬁcant decreases in type-2 error rate (up

to 35 percentage points) compared to state-

of-the-art two-sample tests such as kernel-

methods and classiﬁer two-sample tests.

∗

1 INTRODUCTION

For almost a century, statistical hypothesis testing

has been one of the main methodologies in statistical

inference (Neyman and Pearson, 1933). A classic prob-

lem is to validate whether two sets of observations are

drawn from the same distribution (null hypothesis) or

not (alternative hypothesis). This procedure is called

two-sample test.

∗

We provide code at

https://github.com/mkirchler/

deep-2-sample-test

Proceedings of the 23

International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS) 2020, Palermo, Italy.

Two-sample tests are a pillar of applied statistics and

a standard method for analyzing empirical data in the

sciences, e.g., medicine, biology, psychology, and so-

cial sciences. In machine learning, two-sample tests

have been used to evaluate generative adversarial net-

works (Bińkowski et al., 2018), to test for covariate

shift in data (Zhou et al., 2016), and to infer causal

relationships (Lopez-Paz and Oquab, 2016).

There are two main types of two-sample tests: paramet-

ric and non-parametric ones. Parametric two-sample

tests, such as the Student’s

-test, make strong assump-

tions on the distribution of the data (e.g. Gaussian).

This allows us to compute p-values in closed form. How-

ever, parametric tests may fail when their assumptions

on the data distribution are invalid. Non-parametric

tests, on the other hand, make no distributional as-

sumptions and thus could potentially be applied in a

wider range of application scenarios. Computing non-

parametric test statistics, however, can be costly as it

may require applying re-sampling schemes or comput-

ing higher-order statistics.

A non-parametric test that gained a lot of attention

in the machine-learning community is the kernel two-

sample test and its test statistic: the maximum mean

discrepancy (MMD). MMD computes the average dis-

tance of the two samples mapped into the reproducing

kernel Hilbert space (RKHS) of a universal kernel (e.g.,

Gaussian kernel). MMD critically relies on the choice

of the feature representation (i.e., the kernel function)

and thus might fail for complex, structured data such

as sequences or images, and other data where deep

learning excels.

Another non-parametric two-sample test is the classiﬁer

two-sample test (C2ST). C2ST splits the data into two

chunks, training a classiﬁer on one part and evaluating

it on the remaining data. If the classiﬁer predicts

signiﬁcantly better than chance, the test rejects the

null hypothesis. Since a part of the data set needs to

be put aside for training, not the full data set is used

for computing the test statistic, which limits the power

Two-sample Testing Using Deep Learning

of the method. Furthermore, the performance of the

method depends on the selection of the train-test split.

In this work, we propose a two-sample testing proce-

dure that uses deep learning to obtain a suitable data

representation. It ﬁrst maps the data onto a hidden-

layer of a deep neural network that was trained (in an

unsupervised or supervised fashion) on an independent,

auxiliary data set, and then it performs a location test.

Thus we are able to work on any kind of data that neu-

ral networks can work on, such as audio, images, videos,

time-series, graphs, and natural language. We propose

two test statistics that can be evaluated in linear time

(in the number of observations), based on MMD and

Fisher discriminant analysis, respectively. We derive

asymptotic distributions of both test statistics. Our

theoretical analysis proves that the two-sample test

procedure asymptotically controls the type-1 error rate,

has asymptotically vanishing type-2 error rate and is

robust both with respect to transfer learning and ap-

proximate training.

We empirically evaluate the proposed methodology in

a variety of applications from the domains of computa-

tional musicology, computer vision, and neuroimaging.

In these experiments, the proposed deep two-sample

tests consistently outperform the closest competing

method (including deep kernel methods and C2STs) by

up to 35 percentage points in terms of the type-2 error

rate, while properly controlling the type-1 error rate.

2 PROBLEM STATEMENT &

NOTATION

We consider non-parametric two-sample statistical test-

ing, that is, to answer the question whether two samples

are drawn from the same (unknown) distribution or not.

We distinguish between the case that the two samples

are drawn from the same distribution (the null hypoth-

esis, denoted by

) and the case that the samples

are drawn from diﬀerent distributions (the alternative

hypothesis H

We diﬀerentiate between type-1 errors (i.e,rejecting the

null hypothesis although it holds) and type-2 errors (i.e.,

not rejecting

although it does not hold). We strive

for both the type-1 error rate to be upper bounded by

some signiﬁcance level

, and the type-2 error rate to

converge to 0 for unlimited data. The latter property is

called consistency and means that with suﬃcient data,

the test can reliably distinguish between any pair of

probability distributions.

Let

p, q, p

and

be probability distributions on

with common dominating Borel measure

. We abuse

notation somewhat and denote the densities with re-

spect to

also by

p, q, p

and

. We want to perform

a two-sample test on data drawn from

and

, i.e.

we test the null hypothesis

against the

alternative

p 6

and

are assumed to be in

some sense similar to

and

, respectively, and act as

auxiliary task for tuning the test (the case of

and

is perfectly valid, in which case this is equivalent

to a data splitting technique).

We have access to four (independent) sets

, Y

, X

and

of observations drawn from

p, q, p

, and

respectively. Here

, . . . , X

} ⊂ R

and

∼

for all

(analogue deﬁnitions hold for

, X

, and

). Empirical averages with respect to a function

are denoted by f(X

) :=

i=1

f(X

We investigate function classes of deep ReLU networks

with a ﬁnal tanh activation function:

T F



tanh ◦W

D−1

◦ σ ◦ . . . ◦ σ ◦W

: R

→ R



∈ R

H×d

, W

∈ R

H×H

for j = 2, . . . , D − 1,

D−1

j=1

||W

F ro

≤ β

, D ≤ D







Here, the activation functions

tanh

and

(

) :=

ReLU

(

) =

max

, z

) are applied elementwise,

||·||

F ro

is the Frobenius norm,

+ 1 is the width and

and

are depth and weight restrictions onto the

networks. This can be understood as the mapping onto

the last hidden layer of a neural network concatenated

with a tanh activation.

3 DEEP TWO-SAMPLE TESTING

In this section, we propose two-sample testing based on

two novel test statistics, the

Deep Maximum Mean

Discrepancy (DMMD)

and the

Deep Fisher Dis-

criminant Analysis (DFDA)

. The test asymptoti-

cally controls the type-1 error rate, and it is consistent

(i.e., the type-2 error rate converges to 0). Further-

more, we will show that consistency is preserved under

both transfer learning on a related task, as well as only

approximately solving the training step.

3.1 Proposed Two-sample Test

Our proposed test consists of the following two steps.

1. We train a neural network over an auxiliary training

data set. 2. We then evaluate the maximum mean

discrepancy test statistic (Gretton et al., 2012a) (or

a variant of it) using as kernel the mapping from the

input domain onto the network’s last hidden layer.

3.1.1 Training Step

Let the training data be

and

. Denote

. We run a (potentially inexact) training algorithm

Kirchler, Khorasani, Kloft, Lippert

to ﬁnd φ

∈ T F

with:







i=1

) −

i=1

)







+ η

≥ max

φ∈T F







i=1

φ(X

) −

i=1

φ(Y

)







Here,

η ≥

0 is a ﬁxed leniency parameter (independent

); ﬁnding true global optima in neural networks

is a hard problem, and an

η >

0 allows us to settle

with good-enough, local solutions. This procedure is

also related to the early-stopping regularization tech-

nique, which is commonly used in training deep neural

networks (Prechelt, 1998).

3.1.2 Test Statistic

We deﬁne the mean distance of the two test populations

, Y

measured on the hidden layer of a network

n,m

(φ) := φ(X

) − φ(Y

Using

from the training step, we deﬁne the Deep

Maximum Mean Discrepancy (DMMD) test statistic

n,m

(φ

, X

, Y

) :=

n + m

||D

n,m

(φ

)||

We can normalize this test statistic by the (inverse)

empirical covariance matrix:

n,m

(φ

, X

, Y

) :=

n + m

n,m

(φ

)

−1

n,m

(φ

This leads to a test statistic (which we call Deep Fisher

Discriminant Analysis—DFDA) with an asymptotic

distribution that is easier to evaluate. Note that the

empirical covariance matrix is deﬁned as:

n,m

(φ

) :=

n + m − 1

m+n

i=1

(φ

) − φ

(Z))(φ

) − φ

(Z))

+ ρ

n,m

where

n,m

0 is a factor guaranteeing numerical

stability and invertibility of the covariance matrix, and

Z = {Z

, . . . , Z

m+n

} = {X

, . . . , X

, Y

, . . . , Y

3.1.3 Discussion

Intuitively, we map the data onto the last hidden layer

of the neural network and perform a multivariate loca-

tion test on whether both map to the same location.

If the distance

n,m

between the two means is too

large, we reject the hypothesis that both samples are

drawn from the same distribution. Consistency of this

procedure is guaranteed by the training step.

Interpretation as Empirical Risk Minimization

If we identify

with (

1) and

with (

, −

in a regression setting, this is equivalent to an (inexact)

empirical risk minimization with loss function

(

) =

1 − t

max



i=1

φ(Z

)



= max

max

||w||≤1

i=1

φ(Z

which is equivalent to

min

||w||≤1

φ) :=

i=1

L(t

, w

φ(Z

)), (1)

where we denote by

the empirical risk; the cor-

responding expected risk is

(

) =

− t

(

)].

Assuming that

(

= 1) =

(

−

1) =

, we have

for the Bayes risk

0∗

inf

f:R

→[−1,1]

(

) = 1

− 

with



0 if and only if

. As long as

and

are selected close enough to

and

, respectively, the

corresponding test will be able to distinguish between

the two distributions.

Since we discard

after optimization and use the

norm of the hidden layer on the test set again, this

implies some ﬁne-tuning on the test data, without

compromising the test statistic (see Theorem 3.1 below).

This property is especially helpful in neural networks,

since for practical transfer learning, only ﬁne-tuning

the last layer can be extremely eﬃcient, even if the

transfer and actual task are relatively diﬀerent (Lu

et al., 2015).

Relation to kernel-based tests

The test statistic

n,m

is a special case of the standard squared Maximum

Mean Discrepancy (Gretton et al., 2012b) with the ker-

nel

(

, z

) :=

hφ

(

)

, φ

(

)

(analogously for

n,m

and the Kernel FDA Test (Harchaoui et al., 2008)).

For a ﬁxed feature map

this kernel is not charac-

teristic, and hence the resulting test not necessarily

consistent for arbitrary distributions

p, q

. However, by

ﬁrst choosing

in a data-dependent way, we can still

achieve consistency.

3.2 Control of Type-1 Error

Due to our choice of

, there need not be a unique,

well-deﬁned limiting distribution for the test statistics

when

n, m → ∞

. Instead, we will show that for each

ﬁxed

, the test statistic

n,m

has a well-deﬁned lim-

iting distribution that can be well evaluated. If in

addition the covariance matrix is invertible, then the

same holds for T

n,m

In particular, the following theorem will show that

n,m

(

) converges towards a multivariate normal dis-

tribution for

n, m → ∞

n,m

then is asymptotically

Two-sample Testing Using Deep Learning

distributed like a weighted sum of

variables, and

n,m

like a χ

(again, if well-deﬁned).

Theorem 3.1.

Let

φ ∈ T F

and Σ :=

Cov

(

)) and assume that

n+m

→ r ∈

1) as

n, m → ∞.

(i) As n, m → ∞, it holds that

m + n

n,m

(φ)

→ N(0, Σ).

(ii) As n, m → ∞,

n,m

(φ, X

, Y

)

→

i=1

where

iid

∼ N

1) and

are the eigenvalues of

Σ.

(iii)

If additionally Σ is invertible, and

n,m

↓

0 then

as n, m → ∞

n,m

(φ, X

, Y

)

→ χ

Sketch of proof (full proof in Appendix A.1).

(i) As

under

(

) and

(

) are identically distributed,

n,m

(

) is centered and one can show the result using

a Central Limit Theorem.

(ii) and (iii) then follow from the continuous map-

ping theorem and properties of the multivariate normal

distribution.

Under some additional assumptions we can also use

a Berry-Esseen type of result to quantify the quality

of the normal approximation of

n,m

(

) conditioned

on the training. In particular, if we assume that

and Σ =

Cov

p,q

(

))

, Y

invertible, then

Bentkus (2005) shows that the normal approximation

on convex sets is



1/4

√



. Computing p-values for

both

n,n

and

n,n

only requires computation over

convex sets, so the result is directly applicable.

3.2.1 Computational Aspects

Testing with S

n,m

As shown in Theorem 3.1, the

null distribution of

n,m

can be approximated as the

weighted sum of independent

-variables. There are

several approaches to computing the cumulative distri-

bution function of this distribution, see Bausch (2013)

for an overview and Zhou and Guan (2018) for an im-

plementation. However, computing p-values with this

method can be rather costly.

Alternatively, note that the test statistic

n,m

is linear

in the number of observations and dimensions. Hence,

estimating the null distribution via Monte-Carlo per-

mutation sampling (Ernst et al., 2004) is feasible. Note

also that it suﬃces to evaluate the feature map

each data point only once and then permute the class

labels, saving more time.

In practice we found that the resampling-based test

performed considerably faster. Hence, in the remainder

of this work, we will evaluate the null hypothesis of the

DMMD via the resampling method.

Testing with T

n,m

Since in many practical situa-

tions one wants to use standard neural network archi-

tectures (such as ResNets), the number of neurons in

the last hidden layer

may be rather large, compared

n, m

. Therefore, using the full, high-dimensional

hidden layer representation might lead to suboptimal

normal approximations. Instead, we propose to use

a principal component analysis on the feature repre-

sentation (

(

))

n+m

i=1

to reduce the dimensionality to

H  m

. In fact, this does not break the asymp-

totic theory derived in Theorem 3.1, even though the

PCA is both trained and evaluated on the test data;

details can be found in Appendix C. Unfortunately,

the



1/4

√



rate of convergence is not valid anymore,

due to the observations not being independent. We

still need to grow

towards

with

n, m

in order

for the consistency results in the next section to hold,

however. Empirically we found

min



n+m

, H



to perform well.

The cumulative distribution function of the

distri-

bution can be evaluated very eﬃciently. Although for

the DFDA it is also possible to estimate the null hy-

pothesis via a Monte Carlo permutation scheme, doing

so is more costly than for the DMMD, since it involves

either a matrix inversion once or solving a linear system

for each permutation draw. Hence, in this work we

focus on using the asymptotic distribution.

3.3 Consistency

In this section we show that if (

), the restrictions

, D

on weights and depth of networks in

T F

are

carefully chosen, (

), the transfer task is not too far

from the original task, and (

), the leniency parameter

in the training step is small enough, then our pro-

posed test is consistent, meaning the type-2 error rate

converges to 0.

Theorem 3.2.

Let

p 6

, m

with

→

0∗

= 1

−

the Bayes error for the transfer

task with 

> 0, and assume that the following holds:

(i)

→

→ ∞

and

→ ∞

for

N → ∞

for the parameters of the function classes T F

Kirchler, Khorasani, Kloft, Lippert

(ii) ||p − p

(µ)

+ ||q − q

(µ)

≤ 2δ,

(iii)

≤ δ

η < 

, where

η ≥

0 is the leniency

parameter in training the network, and

(iv) p

and q

have bounded support on R

†

Then, as

N → ∞

both test test statistics

n,m

(

, X

, Y

) and

n,m

(

, X

, Y

) diverge in

probability towards inﬁnity, i.e. for any r > 0

Pr (S(φ

, X

, Y

) > r) → 1 and

Pr (T (φ

, X

, Y

) > r) → 1.

Sketch of proof (full proof in Appendix A.2).

The test

statistics

n,m

is lower-bounded by a rescaled version

√

− R

n,m

(

)), where

with

selected as in

(1)

. Then, if 1

− R

n,m

(

)

≥ c >

0, the

test statistic diverges.

The ﬁnite-sample error

n,m

(

) approaches its popu-

lation version

(

) for large

n, m

, and the diﬀerence

between

(

) and

(

) can be controlled over

The rest of the proof is akin to standard consistency

proofs in regression and classiﬁcation. Namely, we can

split

(

)

−R

0∗

into approximation and estimation

error and control these via a Universal Approximation

Theorem (Hanin, 2017), and Rademacher complexity

bounds on the neural network function class (Golowich

et al., 2017), respectively.

The main caveat of Theorem 3.2 is that it gives no

explicit directions to choose the transfer task

and

. Whether the respective

-densities are

-close to

the testing densities in general cannot be answered,

and similarly the Bayes error rate 1

− 

is not known

beforehand. If abundant data for the testing task is at

hand, then splitting the data is the safe way to go; if

data is scarce, Theorem 3.2 gives justiﬁcation that a

reasonably close transfer task will have good power as

well.

The bounded support requirement (iv) on

and

can

be circumvented as well – by choosing the support large

enough one can always just truncate (

) and (

) and

will still satisfy requirements (ii) and (iii), especially

also in the case of

and

with unbounded

support. This procedure, however, requires knowledge

of where to truncate the transfer distributions. Instead

one can also grow the support of

and

with

; for

more details, see Appendix B.

†

A similar Theorem holds also for the case of unbounded

support, see Appendix B

4 RELATED WORK

In this section, we give an overview over the state-

of-the-art in non-parametric two-sample testing for

high-dimensional data.

Kernel Methods

The methods most related to our

method are the kernelized maximum mean discrepancy

(MMD) (Gretton et al., 2012a) and the kernel Fisher

discriminant analysis (KFDA) (Harchaoui et al., 2008).

Both methods eﬀectively metricize the space of prob-

ability distributions by mapping distribution features

onto mean embeddings in universal reproducing kernel

Hilbert spaces (RKHS, (Steinwart and Christmann,

2008)). Test statistics derived from these mean embed-

dings can be eﬃciently evaluated using the kernel trick

(in quadratic time in the number of observations, al-

though there are lower-powered linear-time variations).

Mean Embeddings (ME) and Smoothed Characteristic

Functions (SCF) (Chwialkowski et al., 2015; Jitkrittum

et al., 2016) are kernel-based linear-time test statistics

that are (almost surely) proper metrics on the space

of probability distributions. All four methods rely on

characteristic kernels to yield consistent tests and are

closely related.

Deep Kernel Methods

In the context of train-

ing and evaluating Generative Adversarial Networks

(GANs), several authors have investigated the use of

the MMD with kernels parametrized by deep neural

networks. In Bińkowski et al. (2018); Li et al. (2017);

Arbel et al. (2018), the authors feed features extracted

from deep neural networks into characteristic kernels.

Jitkrittum et al. (2018) use deep kernels in the con-

text of relative goodness-of-ﬁt testing without directly

considering consistency aspects of this approach. Ex-

tensions from the GAN literature to two-sample testing

is not straightforward since statistical consistency guar-

antees strongly depend on careful selection of the re-

spective function classes. To the best of our knowledge,

all previous works made simplifying assumptions on

injectivity or even invertibility of the involved networks.

In this work we show that a linear kernel on top of

transfer-learned neural network feature maps (as has

also been done by Xu et al. (2018) for GAN evaluation)

is not only suﬃcient for consistency of the test, but also

performs considerably better empirically in all settings

we analyzed. In addition to that, our test statistics can

be directly evaluated in linear instead of quadratic time

(in the sample size) and the corresponding asymptotic

null distributions can be exactly computed (in contrast

to the MMD & KFDA).

Classiﬁer Two-Sample Tests (C2ST)

First pro-

posed by Friedman (2003) and then further analyzed by

Two-sample Testing Using Deep Learning

Kim et al. (2016) and Lopez-Paz and Oquab (2016), the

idea of the C2ST is to utilize a generic classiﬁer, such

as a neural network or a

-nearest neighbor approach

for the two-sample testing problem. In particular, they

split the available data into training and test set, train

a classiﬁer on the training set and evaluate whether the

performance on the test set exceeds random variation.

The main drawback of this approach is that the data

has to be split in two chunks, creating a trade-oﬀ: if

the training set is too small, the classiﬁer is unlikely

to ﬁnd a statistically relevant signal in the data; if the

training set is large and thus the test set small, the

C2ST test loses power.

Our method circumvents the need to split the data in

training and test set – Theorem 3.2 shows that training

on a reasonably close transfer data set is suﬃcient.

Even more, as shown in Section 3.1.3, our method

can be interpreted as empirical risk minimization with

additional ﬁne-tuning of the last layer on the testing

data, guaranteed to be as least as good as an equivalent

method with ﬁxed last layer.

5 EXPERIMENTS

In this section, we compare our proposed deep learning

two-sample tests with other state-of-the-art approaches.

5.1 Experimental setup

For the

DFDA

and

DMMD

tests we train a deep neu-

ral network on a related task; details will be deferred

to the corresponding sections. We report both the per-

formance of the deep MMD

n,m

where we estimate

the null hypothesis via a Monte Carlo permutation

sample (Ernst et al., 2004) (we ﬁx

= 1000 resam-

pling permutations except otherwise noted), and the

deep FDA statistic

n,m

, for which we use the asymp-

totic

distribution. As explained in Section 3.2.1,

for the DFDA we project the last hidden layer onto

H < H

dimensions using a PCA. We found the heuris-

tic

m+n

to perform well across a number of

tasks (disjoint from the ones presented in this section).

For the DMMD we do not need any dimensionality

reduction. We calibrated parameters of both tests on

data disjoint from the ones that we report results on

in the subsequent sections.

For the

C2ST

, we train a standard logistic regression

on top of the pretrained features extracted from the

same neural network as for our methods.

For the

kernel MMD

we report two kernel band-

width selection strategies for the Gaussian kernel. The

ﬁrst variant is the “median distance“ heuristic (Gretton

et al., 2012a) which selects the median of the euclidean

distances of all data points (MMD-med). The second

variant, reported by Gretton et al. (2012b), splits the

data in two disjoint sets and selects the bandwidth

that maximizes power on the ﬁrst set and evaluates

the MMD on the second set (MMD-opt). We use the

implementation provided by Jitkrittum et al. (2016),

which estimates the null hypothesis via a Monte Carlo

permutation scheme (we again use

= 1000 permuta-

tions).

For the

Smoothed Characteristic Functions

(SCF)

and

Mean Embeddings

(ME), we select the number

of test locations based on the task and sample size. The

locations are selected either randomly (as presented by

Chwialkowski et al. (2015)) or optimized on half of the

data via the procedure described by Jitkrittum et al.

(2016). The kernel was either selected using the me-

dian heuristic, or via a grid search as by Chwialkowski

et al. (2015); Jitkrittum et al. (2016). In each case we

report the kernel and location selection method that

performed best on the given task, with details given

in the corresponding paragraphs. Note that for very

small sample sizes, both SCF and ME oftentimes do

not control the type-1 error rate properly, since they

were designed for larger sample sizes. This results in

highly variable type-2 error rate for small

in the ex-

periments. Again, we use the implementation provided

by Jitkrittum et al. (2016).

In addition to these published methods, we also com-

pare our method against a

deep kernel MMD test

(k-DMMD), i.e. the MMD test where the output of a

pretrained neural network gets fed into a Gaussian ker-

nel (instead of a linear kernel as in our case). Jitkrittum

et al. (2018) used this method for relative goodness-

of-ﬁt testing instead of two-sample testing. For image

data, we select the bandwidth parameter for the Gaus-

sian kernel via the median heuristic, and for audio data

via the power maximization technique (in each case

the other variant performs considerably worse); the

pretrained networks are the same as for our tests and

the C2ST.

All experiments were run over 1000 runs. Type-1 error

rates are estimated by drawing both samples (without

replacement) from the same class and computing the

rate of rejections. Similarly, type-2 error rates are esti-

mated as the rate of not rejecting the null hypothesis

when sampling from two distinct classes. All ﬁgures of

type-1 and type-2 error rates show the 95% conﬁdence

interval based on a Wilson Score interval (and a “rule-of-

three“ approximation in the case of 0-values (Eypasch

et al., 1995)). In all settings we ﬁxed the signiﬁcance

level at

= 0

05. In addition to that we show in Ap-

pendix D.3 empirically that also for smaller signiﬁcance

levels high power can be preserved. Preprocessing for

image data is explained in Appendix D.2.

Kirchler, Khorasani, Kloft, Lippert

0 100 200 300 400 500

m (per population)

0.00

0.05

0.10

Type-1 Error Rate

(a) Type-1 error rate on AM audio data.

0 100 200 300 400 500

m (per population)

0.0

0.2

0.4

0.6

0.8

1.0

Type-2 Error Rate

(b) Type-2 error rate on AM audio data.

DFDA-sup (ours)

DMMD-sup (ours)

C2ST-sup

k-DMMD-sup

MMD-med

MMD-opt

SCF

DFDA-unsup (ours)

DMMD-unsup (ours)

C2ST-unsup

k-DMMD-unsup

50 100 150 200

m (per population)

0.0

0.2

0.4

0.6

0.8

1.0

Type-2 Error Rate

50 100 150 200

m (per population)

0.0

0.2

0.4

0.6

0.8

1.0

Type-2 Error Rate

(d) Type-2 error rate on KDEF data.

50 100 150 200

m (per population)

0.0

0.2

0.4

0.6

0.8

1.0

Type-2 Error Rate

(e) Type-2 error rate on dogs data.

Figure 1: Results on AM audio (top row) and natural image (bottom row) data sets. Suﬃxes “-sup“ indicate

supervised pretraining, “-unsup“ indicates unsupervised pretraining.

5.2 Control of Type-1 Error Rate

Since the presented test procedures are not exact tests

it is important to verify that the type-1 error rate is

controlled at the proper level. Figure 1a shows that

the empirical type-1 error rate is well controlled for the

amplitude modulated audio data introduced in the next

section. For the other data sets, results are provided

in Appendix D.4.

5.3 Power Analysis

Amplitude Modulated Audio Data

Here we an-

alyze the proposed test on the amplitude modulated

audio example from (Gretton et al., 2012b). The task

in this setting is to distinguish snippets from two dif-

ferent songs after they have been amplitude modulated

(AM) and mixed with noise. We use the same pre-

processing and amplitude modulation as Gretton et al.

(2012b). We use the freely available music from Gra-

matik (2014); distribution

is sampled from track four,

distribution

from track ﬁve and the remaining tracks

on the album were used for training the network in a

multi-class classiﬁcation setting. As our neural network

architecture we use a simple convolutional network, a

variant from Dai et al. (2017), called M5 therein; see

Appendix D.6 for details.

Figure 1b reports the results with varying number of

observations under constant noise level

= 1. Our

method shows high power, even at low sample sizes,

whereas kernel methods need large amounts of data

to deal with the task. Note that these results are

consistent with the original results in Gretton et al.

(2012b), where the authors ﬁxed the sample size at

000 and consequently only used the (signiﬁcantly

less powerful) linear-time MMD test.

Aircraft

We investigate the Fine-Grained Visual

Classiﬁcation of Aircraft data set (Maji et al., 2013).

We select two visually similar aircraft families, namely

Boeing 737 and Boeing 747 as populations

and

respectively. The neural network embeddings are ex-

tracted from a ResNet-152 (He et al., 2016) trained on

ILSVRC (Russakovsky et al., 2015). Figure 1c shows

that all neural network architectures perform consid-

erably better than the kernel methods. Furthermore,

our proposed tests can also outperform both the C2ST

and the deep kernel MMD.

Facial Expressions

The Karolinska Directed Emo-

tional Faces (KDEF) data set (Lundqvist et al., 1998)

Two-sample Testing Using Deep Learning

Table 1: Results on neuroimaging data, comparing

subjects who are cognitive normal (CN), have mild cog-

nitive impairment (MCI) or have Alzheimer’s disease

(AD). APOE has neutral variant

3 and risk-factor

variant

4. Numbers in parentheses denote sample size.

X (# obs) Y (# obs) p-value

CN (490) AD (314) 9.49 · 10

−5

CN (490) MCI (287) 2.44 · 10

−4

MCI (287) AD (314) 1.45 · 10

−3

APOE ε3 (811) APOE ε4 (152) 1.40 · 10

−2

has been previously used by Jitkrittum et al. (2016);

Lopez-Paz and Oquab (2016). The task is to distin-

guish between faces showing positive (happy, neutral,

surprised) and negative (afraid, angry, disgusted) emo-

tions. The feature embeddings are again obtained from

a ResNet-152 trained on ILSVRC. Results can be found

in Figure 1d. Even though the images in ImageNet

and KDEF are very diﬀerent, the neural network tests

again outperform the kernel methods. Also note that

the apparent advantage of the mean embedding test

for low sample sizes is due to an unreasonably high

type-1 error rate (

11 and

085 at

= 10

15,

respectively).

Stanford Dogs

Lastly, we evaluate our tests on the

Stanford Dogs data set (Khosla et al., 2011), consisting

of 120 classes of diﬀerent dog breeds. As test classes

we select the dog breeds ‘Irish wolfhound‘ and ‘Scot-

tish deerhound‘, two breeds that are visually extremely

similar. Since the data set is a subset of the ILSVRC

data, we cannot train the networks on the whole Im-

ageNet data again. Instead, we train a small 6-layer

convolutional neural network on the remaining 118

classes in a multi-class classiﬁcation setting and use

the embedding from the last hidden layer. To show

that our tests can also work with unsupervised transfer-

learning, we also train a convolutional autoencoder on

this data; the encoder part is identical to the super-

vised CNN, see Appendix D.7 for details. Note that

for this setting, the theoretical consistency guarantees

from Theorem 3.2 do not hold, although the type-1

error rate is still asymptotically controlled. Figure 1e

reports the results, with *-sup denoting the supervised,

and *-unsup the unsupervised transfer-learning task.

As expected, tests based on the supervised embedding

approach outperform other tests by a large margin.

However, the unsupervised DMMD and DFDA still

outperform kernel-based tests. Interestingly, both the

C2ST and the k-DMMD method seem to suﬀer more

severely from the mediocre feature embedding than our

tests. One potential explanation for this phenomenon

is the ability of DMMD and DFDA to ﬁne-tune on the

Figure 2: Slices of 3D-MRI scans of an Alzheimer’s

disease patient (A) and a cognitively normal individ-

ual (B). Note the enlargement of the lateral ventricles

(indicated by red arrows) in the Alzheimer’s disease

patient.

test data without the need to perform a data split.

Three-dimensional Neuroimaging Data

In this

section, we apply the DFDA test procedure to 3D

Magnetic Resonance Imaging (MRI) scans and genetic

information from the Alzheimer’s Disease Neuroimag-

ing Initiative (ADNI) (Mueller et al., 2005). To this

end, we transfer a 3D convolutional autoencoder that

has been trained on MRI scans from the Brain Ge-

nomics Superstruct Project (Holmes et al., 2015) to

perform statistical testing on the ADNI data. Details

on preprocessing and network architecture are provided

in Appendix D.10.

The ADNI dataset consists of individuals diagnosed

with Alzheimer’s Disease (AD), with Mild Cognitive

Impairment (MCI), or as cognitively normal (CN);

Figure 2 shows exemplaric images of an AD and a

CN subject. Table 1 shows that our test can detect

statistically signiﬁcant diﬀerences between MRI scans

of individuals with a diﬀerent diagnosis. Additionally,

we evaluate whether our test can detect diﬀerences

between individuals who have a known genetic risk

factor for neurodegenerative diseases and individuals

without that risk factor. In particular, we compare the

two variants

3 (the “normal” variant) and

4 (the risk-

factor variant) in the Apolipoprotein E (APOE ) gene,

which is related to AD and other diseases (Corder et al.,

1993). By grouping subjects according to which variant

they exhibit we test for statistical dependence between

a (binary) genetic mutation and (continuous) variation

in 3D MRI scans. Table 1 shows that individuals with

4 and

3 APOE variants are signiﬁcantly diﬀerent,

suggesting a statistical dependence between genetic

variation and structural brain features.

Kirchler, Khorasani, Kloft, Lippert

Acknowledgements

The authors thank Stefan Konigorski and Jesper Lund

for helpful discussions and comments. Marius Kloft

acknowledges support by the German Research Foun-

dation (DFG) award KL 2698/2-1 and by the Federal

Ministry of Science and Education (BMBF) awards

031L0023A, 01IS18051A, and 031B0770E. Part of the

work was done while Marius Kloft was a sabbatical

visitor of the DASH Center at the University of South-

ern California. This work has been funded by the

Federal Ministry of Education and Research (BMBF,

Germany) in the project KI-LAB-ITSE (project num-

ber 01|S19066).

Data used in the preparation of this article were

obtained from the Alzheimer’s Disease Neuroimaging

Initiative (ADNI) database

adni.loni.usc.edu

As such, the investigators within the ADNI con-

tributed to the design and implementation of ADNI

and/or provided data but did not participate in

analysis or writing of this report. A complete

listing of ADNI investigators can be found at:

http://adni.loni.usc.edu/wp-content/uploads/

how_to_apply/ADNI_Acknowledgement_List.pdf

Data collection and sharing of ADNI was funded by the

Alzheimer’s Disease Neuroimaging Initiative (ADNI)

(National Institutes of Health Grant U01 AG024904)

and DOD ADNI (Department of Defense award

number W81XWH-12-2-0012). ADNI is funded by the

National Institute on Aging, the National Institute of

Biomedical Imaging and Bioengineering, and through

generous contributions from the following: Alzheimer’s

Association; Alzheimer’s Drug Discovery Foundation;

BioClinica Inc; Biogen Idec Inc; Bristol-Myers Squibb

Company; Eisai Inc; Elan Pharmaceuticals Inc; Eli

Lilly and Company; F. Hoﬀmann-La Roche Ltd and

its aﬃliated company Genentech Inc; GE Healthcare;

Innogenetics N.V.; IXICO Ltd; Janssen Alzheimer

Immunotherapy Research & Development LLC;

Johnson & Johnson Pharmaceutical Research &

Development LLC; Medpace Inc; Merck & Co Inc;

Meso Scale Diagnostics LLC; NeuroRx Research;

Novartis Pharmaceuticals Corporation; Pﬁzer Inc;

Piramal Imaging; Servier; Synarc Inc; and Takeda

Pharmaceutical Company. The Canadian Institutes of

Health Research is providing funds to support ADNI

clinical sites in Canada. Private sector contributions

are facilitated by the Foundation for the National

Institutes of Health (

http://www.fnih.org

). The

grantee organization is the Northern California

Institute for Research and Education, and the study is

coordinated by the Alzheimer’s Disease Cooperative

Study at the University of California, San Diego.

ADNI data are disseminated by the Laboratory for

Neuro Imaging at the University of Southern California.

Samples from the National Cell Repository for AD

(NCRAD), which receives government support under a

cooperative agreement grant (U24 AG21886) awarded

by the National Institute on Aging (AIG), were used

in this study. Funding for the WGS was provided by

the Alzheimer’s Association and the Brin Wojcicki

Foundation.

References

Michael Arbel, Dougal Sutherland, Mikołaj Bińkowski,

and Arthur Gretton. On gradient regularizers for

mmd gans. In Advances in Neural Information Pro-

cessing Systems, pages 6700–6710, 2018.

Johannes Bausch. On the eﬃcient calculation of a

linear combination of chi-square random variables

with an application in counting string vacua. Journal

of Physics A: Mathematical and Theoretical, 46(50):

505202, 2013.

Vidmantas Bentkus. A lyapunov-type bound in rd.

Theory of Probability & Its Applications, 49(2):311–

323, 2005.

Mikołaj Bińkowski, Dougal J Sutherland, Michael Ar-

bel, and Arthur Gretton. Demystifying mmd gans.

arXiv preprint arXiv:1801.01401, 2018.

Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdi-

novic, and Arthur Gretton. Fast two-sample testing

with analytic representations of probability measures.

In Advances in Neural Information Processing Sys-

tems, pages 1981–1989, 2015.

EH Corder, AM Saunders, WJ Strittmatter,

DE Schmechel, PC Gaskell, GW Small, AD Roses,

JL Haines, and MA Pericak-Vance. Gene dose

of apolipoprotein e type 4 allele and the risk of

alzheimer’s disease in late onset families. Science,

261(5):921–923, 1993.

Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, and Samar-

jit Das. Very deep convolutional neural networks for

raw waveforms. In 2017 IEEE International Con-

ference on Acoustics, Speech and Signal Processing

(ICASSP), pages 421–425. IEEE, 2017.

Luc Devroye, László Györﬁ, and Gábor Lugosi. A

probabilistic theory of pattern recognition, volume 31.

Springer Science & Business Media, 2013.

Rick Durrett. Probability: theory and examples, vol-

ume 49. Cambridge university press, 2019.

Michael D Ernst et al. Permutation methods: a basis

for exact inference. Statistical Science, 19(4):676–685,

2004.

Ernst Eypasch, Rolf Lefering, CK Kum, and Hans

Troidl. Probability of adverse events that have not

yet occurred: a statistical reminder. Bmj, 311(7005):

619–620, 1995.

Two-sample Testing Using Deep Learning

Jerome Friedman. On multivariate goodness-of-ﬁt and

two-sample testing. In Statistical Problems in Parti-

cle Physics, Astrophysics, and Cosmology, page 311,

2003.

Noah Golowich, Alexander Rakhlin, and Ohad Shamir.

Size-independent sample complexity of neural net-

works. arXiv preprint arXiv:1712.06541, 2017.

Gramatik. The age of reason.

http://

dl.lowtempmusic.com/Gramatik-TAOR.zip

, 2014.

[Online; accessed May/23/2019].

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch,

Bernhard Schölkopf, and Alexander Smola. A ker-

nel two-sample test. Journal of Machine Learning

Research, 13(Mar):723–773, 2012a.

Arthur Gretton, Dino Sejdinovic, Heiko Strathmann,

Sivaraman Balakrishnan, Massimiliano Pontil, Kenji

Fukumizu, and Bharath K Sriperumbudur. Optimal

kernel choice for large-scale two-sample tests. In

Advances in neural information processing systems,

pages 1205–1213, 2012b.

Boris Hanin. Universal function approximation by deep

neural nets with bounded width and relu activations.

arXiv preprint arXiv:1708.02691, 2017.

Zaïd Harchaoui, Francis R Bach, and Èric Moulines.

Testing for homogeneity with kernel ﬁsher discrim-

inant analysis. In Advances in Neural Information

Processing Systems, pages 609–616, 2008.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian

Sun. Deep residual learning for image recognition.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 770–778, 2016.

Avram J Holmes, Marisa O Hollinshead, Timothy M

O’Keefe, Victor I Petrov, Gabriele R Fariello,

Lawrence L Wald, Bruce Fischl, Bruce R Rosen,

Ross W Mair, Joshua L Roﬀman, et al. Brain ge-

nomics superstruct project initial data release with

structural, functional, and behavioral measures. Sci-

entiﬁc data, 2:150031, 2015.

Sergey Ioﬀe and Christian Szegedy. Batch normal-

ization: Accelerating deep network training by

reducing internal covariate shift. arXiv preprint

arXiv:1502.03167, 2015.

Wittawat Jitkrittum, Zoltán Szabó, Kacper P

Chwialkowski, and Arthur Gretton. Interpretable dis-

tribution features with maximum testing power. In

Advances in Neural Information Processing Systems,

pages 181–189, 2016.

Wittawat Jitkrittum, Heishiro Kanagawa, Patsorn

Sangkloy, James Hays, Bernhard Schölkopf, and

Arthur Gretton. Informative features for model com-

parison. In Advances in Neural Information Process-

ing Systems, pages 808–819, 2018.

Aditya Khosla, Nityananda Jayadevaprakash, Bang-

peng Yao, and Li Fei-Fei. Novel dataset for ﬁne-

grained image categorization. In First Workshop on

Fine-Grained Visual Categorization, IEEE Confer-

ence on Computer Vision and Pattern Recognition,

Colorado Springs, CO, June 2011.

Ilmun Kim, Aaditya Ramdas, Aarti Singh, and Larry

Wasserman. Classiﬁcation accuracy as a proxy for

two sample testing. arXiv preprint arXiv:1602.02210,

2016.

Diederik P Kingma and Jimmy Ba. Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980, 2014.

Erich L Lehmann and Joseph P Romano. Testing

statistical hypotheses. Springer Science & Business

Media, 2006.

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming

Yang, and Barnabás Póczos. Mmd gan: Towards

deeper understanding of moment matching network.

In Advances in Neural Information Processing Sys-

tems, pages 2203–2213, 2017.

David Lopez-Paz and Maxime Oquab. Revisit-

ing classiﬁer two-sample tests. arXiv preprint

arXiv:1610.06545, 2016.

Jie Lu, Vahid Behbood, Peng Hao, Hua Zuo, Shan

Xue, and Guangquan Zhang. Transfer learning using

computational intelligence: a survey. Knowledge-

Based Systems, 80:14–23, 2015.

Daniel Lundqvist, Anders Flykt, and Arne Öhman. The

karolinska directed emotional faces (kdef). CD ROM

from Department of Clinical Neuroscience, Psychol-

ogy section, Karolinska Institutet, 91:630, 1998.

S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and

A. Vedaldi. Fine-grained visual classiﬁcation of air-

craft. Technical report, 2013.

Bettina Mieth, Marius Kloft, Juan Antonio Rodríguez,

Sören Sonnenburg, Robin Vobruba, Carlos Morcillo-

Suárez, Xavier Farré, Urko M Marigorta, Ernst Fehr,

Thorsten Dickhaus, et al. Combining multiple hy-

pothesis testing with machine learning increases the

statistical power of genome-wide association studies.

Scientiﬁc reports, 6:36671, 2016.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Tal-

walkar. Foundations of machine learning. MIT press,

2018.

Susanne G Mueller, Michael W Weiner, Leon J Thal,

Ronald C Petersen, Cliﬀord Jack, William Jagust,

John Q Trojanowski, Arthur W Toga, and Laurel

Beckett. The alzheimer’s disease neuroimaging ini-

tiative. Neuroimaging Clinics, 15(4):869–877, 2005.

J Neyman and ES Pearson. On the problem of the most

eﬃcient tests of statistical hypotheses. Philosophical

Kirchler, Khorasani, Kloft, Lippert

Transactions of the Royal Society of London. Series

A, Containing Papers of a Mathematical or Physical

Character, 231:289–337, 1933.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory

Chanan, Edward Yang, Zachary DeVito, Zeming Lin,

Alban Desmaison, Luca Antiga, and Adam Lerer.

Automatic diﬀerentiation in pytorch. In NIPS-W,

2017.

Lutz Prechelt. Early stopping-but when? In Neural

Networks: Tricks of the trade, pages 55–69. Springer,

1998.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,

Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej

Karpathy, Aditya Khosla, Michael Bernstein, Alexan-

der C. Berg, and Li Fei-Fei. ImageNet Large Scale

Visual Recognition Challenge. International Journal

of Computer Vision (IJCV), 115(3):211–252, 2015.

doi: 10.1007/s11263-015-0816-y.

Ingo Steinwart and Andreas Christmann. Support vec-

tor machines. Springer Science & Business Media,

2008.

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be-

longie. The Caltech-UCSD Birds-200-2011 Dataset.

Technical Report CNS-TR-2011-001, California In-

stitute of Technology, 2011.

Qiantong Xu, Gao Huang, Yang Yuan, Chuan Guo,

Yu Sun, Felix Wu, and Kilian Weinberger. An em-

pirical study on evaluation metrics of generative ad-

versarial networks. arXiv preprint arXiv:1806.07755,

2018.

Hao Zhou, Vamsi K Ithapu, Sathya Narayanan Ravi,

Vikas Singh, Grace Wahba, and Sterling C Johnson.

Hypothesis testing in unsupervised domain adap-

tation with applications in alzheimer’s disease. In

Advances in neural information processing systems,

pages 2496–2504, 2016.

Quan Zhou and Yongtao Guan. On the null distribu-

tion of bayes factors in linear regression. Journal

of the American Statistical Association, 113(523):

1362–1371, 2018.