Large scale testing


In-depth Articles

Multiple testing

true (Null hypothesis \(H_0\) is true) false (Alternative hypothesis \(H_A\) is true) Total
rejected (Test is declared significant) \(V\) \(U\) \(R\)
not rejected (Test is declared not significant) \(m_0-V\) \(m_1-U\) \(m-R\)
Total \(m_0\) \(m_1\) \(m\)


where:

  • \(m\) is the total number of hypotheses tested
  • \(m_0\) is the number of true null hypotheses
  • \(m_1=m-m_0\) is the number of true alternative hypotheses
  • \(V\) is the number of false positives (type I error) (also called "false discoveries")
  • \(U\) is the number of true positives (also called "true discoveries")
  • \(m_1-U\) is the number of false negatives (type II error)
  • \(m_0-V\) is the number of true negatives
  • \(R = V + U\) is the number of rejected null hypotheses (also called "discoveries", true or false)

Type I and type II errors are in direct competition and a compromise between the two is necessary. If we reject more hypotheses, we usually have more type I errors but fewer type II errors and vice versa.

EXAMPLE:

where \(z_{1-\alpha}\) is the quantile of order \(1-\alpha\) of a \(N(0,1)\) and \(\Phi\) is the cdf (cumulative distribution function) of a \(N(0,1)\)

set.seed(123)
alpha = 0.05
m = 100
m0 = 80
effect = 3
setT = sample(1:m, size=m0, replace=F)
stats <- rnorm(m)
stats[-setT] <- stats[-setT] + effect
pvals = pnorm(stats, lower.tail = FALSE)
setR = which(pvals <= alpha)
# setR = which(stats >= qnorm(1-alpha)) # equivalently
table( rejected= 1:m %in% setR,
true = 1:m %in% setT)
##         true
## rejected FALSE TRUE
##    FALSE     3   75
##    TRUE     17    5

17 + 5 = 22 are the rejected hypotheses while 3 + 75 are the accepted hypotheses. There are 5 type I errors (5 wrongly rejected out of 22) and 3 type II errors

Familywise error rate

Numerous testing methods seek to reject as many hypotheses as possible while maintaining a certain measure of type I errors. The most classic way of controlling for multiple tests is the Familywise error rate (FWER). The FWER is the probability that the rejected set contains errors:

\[\mathrm{FWER} = \mathrm{P}(V > 0)\]

The FWER is controlled at level $ $ when the set $ $ (i.e., the threshold $ T $) is chosen such that: \[ \mathrm {FWER} \leq \alpha \]

Bonferroni

The Bonferroni method controls FWER at level \(\alpha\) by rejecting hypotheses only if they have a p-value less than \(\alpha / m\):

\[\mathcal{R} = \left\{H_i: p_{i}\leq \frac{\alpha}{m}\right\}\]

This single-step adjustment of the significance threshold is the simplest multiple testing method (requires no assumptions).

The adjusted p-value according to the Bonferroni procedure corresponds to:

\[\tilde{p}_i = \min(mp_i, 1)\]

and the Bonferroni method rejects null hypotheses that have an adjusted p-value smaller than \(\alpha\):

\[\mathcal{R} = \{H_i: \tilde{p}_{i}\leq \alpha \}\]

Bonferroni proof
The assumptions on \(p\)-values often hold only for true hypotheses which we call $ q_1, , q_{m_0} $, if their corresponding hypotheses are true, these \(p\)-values are uniformly distributed between 0 and 1, or they can be stochastically greater than uniform: \[ \mathrm{Pr}(q_i \leq u ) \leq u \]

To properly motivate the Bonferroni method, we should consider it as a corollary to Boole's inequality, which states that for any collection of events $ E_1,  ldots, E_k $, abbiamo:

\[\mathrm{P}\big(\bigcup_{i=1}^k E_i\big) \leq \sum_{i=1}^k \mathrm{P}(E_i)\]

From Boole's inequality, the probability that there exists some \(i\) for which \(q_i \leq \alpha/m\) is given by \[\Pr\big(\bigcup_{i=1}^{m_0} \{ q_i \leq \alpha/m \}\big) \leq \sum_{i=1}^{m_0} \mathrm{P}( q_i \leq \alpha/m ) \leq m_0\frac{\alpha}{m} \leq \alpha\]

This method commits only type I errors if \(q_i \leq \alpha/m\) for some \(i\), this proves FWER control at level \(\alpha\) for the Bonferroni method.

set.seed(123)
alpha = 0.05
m = 100
m0 = 80
effect = 3
setT = sample(1:m, size=m0, replace=F)
stats <- rnorm(m)
stats[-setT] <- stats[-setT] + effect
pvals = pnorm(stats, lower.tail = FALSE)
setR = which(pvals <= alpha/m)
# setR = which(stats >= qnorm(1-alpha/m)) # equivalently
table( rejected= 1:m %in% setR,
true = 1:m %in% setT)
##         true
## rejected FALSE TRUE
##    FALSE    16   80
##    TRUE      4    0

Now we have 0 type I errors but 16 type II errors, the latter have increased.

Magnitude of Bonferroni's threshold

\(Z_i \sim N(\mu_i,1)\) is the test statistic

\(H_i: \mu_i = 0\) vs \(\bar{H}_i: \mu_i > 0\)

Bonferroni rejects \(H_i\) if \(Z_i \geq z_{1-\alpha/m}\).

  • Setting \(\alpha\) fixed, we can show that for a \(m > > 0\) \[z_{1-\alpha/m} \approx \sqrt{2\log m}\]

  • There is asymptotically no dependence on \(\alpha\)


alpha = 0.05
m = 10^(2:12)
threshold = qnorm(1-alpha/m)
approx = sqrt(2*log(m))
plot(m,threshold,type="l",log="x")
lines(m,approx, col=2)