Problem:
Let's assume a collection of null hypotheses \(\mathcal{H}=\{H_1, \ldots, H_m\}\) that we want to reject. An unknown number \(m_0\) of these turn out to be true while others \(m_1= m-m_0\) are false. We call the set of true hypotheses \(\mathcal{T} \subseteq \mathcal{H}\) and the remaining collection of false hypotheses \(\mathcal{F} = \mathcal{H} \setminus \mathcal{T}\), the proportion of true hypotheses will therefore be \(\pi_0 = m_0/m\).
Objective:
The objective of the multiple testing procedure is to choose \(\mathcal{R} \subseteq \mathcal{H}\) hypotheses to reject
Theoretical resolution:
If we have \(p\)-values \(p_1,\ldots,p_m\) for each hypothesis \(H_1,\ldots,H_m\) we would intuitively select those tests that: \[\mathcal{R} = \{H_i: p_{i}\leq T\}\] reject all hypotheses with a \(p\)-value less than a threshold equal to \(T\). Theoretically, the set of rejected hypotheses \(\mathcal{R}\) should be as consistent as possible with \(\mathcal{F}\)
Two types of errors:
False positives, or type I errors, are rejected hypotheses that are not false, i.e. \(\mathcal{R} \cap \mathcal{T}\); false negatives or type II errors are false hypotheses that we failed to reject, i.e. \(\mathcal{F} \setminus \mathcal{R}\). Rejected hypotheses are sometimes called discoveries, so the terms true discovery and false discovery are sometimes used for correct and incorrect rejections.
We can summarize the number of errors that occur in a hypothesis testing procedure in a contingency table. We can observe \(m\) and \(R\), but all quantities in the first two columns of the table are unobservable.
| true (Null hypothesis \(H_0\) is true) | false (Alternative hypothesis \(H_A\) is true) | Total | |
|---|---|---|---|
| rejected (Test is declared significant) | \(V\) | \(U\) | \(R\) |
| not rejected (Test is declared not significant) | \(m_0-V\) | \(m_1-U\) | \(m-R\) |
| Total | \(m_0\) | \(m_1\) | \(m\) |
where:
Type I and type II errors are in direct competition and a compromise between the two is necessary. If we reject more hypotheses, we usually have more type I errors but fewer type II errors and vice versa.
EXAMPLE:
\(Z_i \sim N(\mu_i,1)\) is the test statistic
\(H_i: \mu_i = 0\) vs \(\bar{H}_i: \mu_i > 0\)
We reject \(H_i\) if \(Z_i \geq z_{1-\alpha}\) or equivalently when \(p_i = 1-\Phi(Z_i) \leq \alpha\)
where \(z_{1-\alpha}\) is the quantile of order \(1-\alpha\) of a \(N(0,1)\) and \(\Phi\) is the cdf (cumulative distribution function) of a \(N(0,1)\)
set.seed(123)
alpha = 0.05
m = 100
m0 = 80
effect = 3
setT = sample(1:m, size=m0, replace=F)
stats <- rnorm(m)
stats[-setT] <- stats[-setT] + effect
pvals = pnorm(stats, lower.tail = FALSE)
setR = which(pvals <= alpha)
# setR = which(stats >= qnorm(1-alpha)) # equivalently
table( rejected= 1:m %in% setR,
true = 1:m %in% setT)
## true
## rejected FALSE TRUE
## FALSE 3 75
## TRUE 17 5
17 + 5 = 22 are the rejected hypotheses while 3 + 75 are the accepted hypotheses. There are 5 type I errors (5 wrongly rejected out of 22) and 3 type II errors
Numerous testing methods seek to reject as many hypotheses as possible while maintaining a certain measure of type I errors. The most classic way of controlling for multiple tests is the Familywise error rate (FWER). The FWER is the probability that the rejected set contains errors:
\[\mathrm{FWER} = \mathrm{P}(V > 0)\]
The FWER is controlled at level $ $ when the set $ $ (i.e., the threshold $ T $) is chosen such that: \[ \mathrm {FWER} \leq \alpha \]
The Bonferroni method controls FWER at level \(\alpha\) by rejecting hypotheses only if they have a p-value less than \(\alpha / m\):
\[\mathcal{R} = \left\{H_i: p_{i}\leq \frac{\alpha}{m}\right\}\]
This single-step adjustment of the significance threshold is the simplest multiple testing method (requires no assumptions).
The adjusted p-value according to the Bonferroni procedure corresponds to:
\[\tilde{p}_i = \min(mp_i, 1)\]
and the Bonferroni method rejects null hypotheses that have an adjusted p-value smaller than \(\alpha\):
\[\mathcal{R} = \{H_i: \tilde{p}_{i}\leq \alpha \}\]
Bonferroni proof
The assumptions on \(p\)-values often hold only for true hypotheses which we call $ q_1, , q_{m_0} $, if their corresponding hypotheses are true, these \(p\)-values are uniformly distributed between 0 and 1, or they can be stochastically greater than uniform: \[
\mathrm{Pr}(q_i \leq u ) \leq u
\]
To properly motivate the Bonferroni method, we should consider it as a corollary to Boole's inequality, which states that for any collection of events $ E_1, ldots, E_k $, abbiamo:
\[\mathrm{P}\big(\bigcup_{i=1}^k E_i\big) \leq \sum_{i=1}^k \mathrm{P}(E_i)\]
From Boole's inequality, the probability that there exists some \(i\) for which \(q_i \leq \alpha/m\) is given by \[\Pr\big(\bigcup_{i=1}^{m_0} \{ q_i \leq \alpha/m \}\big) \leq \sum_{i=1}^{m_0} \mathrm{P}( q_i \leq \alpha/m ) \leq m_0\frac{\alpha}{m} \leq \alpha\]
This method commits only type I errors if \(q_i \leq \alpha/m\) for some \(i\), this proves FWER control at level \(\alpha\) for the Bonferroni method.
set.seed(123)
alpha = 0.05
m = 100
m0 = 80
effect = 3
setT = sample(1:m, size=m0, replace=F)
stats <- rnorm(m)
stats[-setT] <- stats[-setT] + effect
pvals = pnorm(stats, lower.tail = FALSE)
setR = which(pvals <= alpha/m)
# setR = which(stats >= qnorm(1-alpha/m)) # equivalently
table( rejected= 1:m %in% setR,
true = 1:m %in% setT)
## true
## rejected FALSE TRUE
## FALSE 16 80
## TRUE 4 0
Now we have 0 type I errors but 16 type II errors, the latter have increased.
Magnitude of Bonferroni's threshold
\(Z_i \sim N(\mu_i,1)\) is the test statistic
\(H_i: \mu_i = 0\) vs \(\bar{H}_i: \mu_i > 0\)
Bonferroni rejects \(H_i\) if \(Z_i \geq z_{1-\alpha/m}\).
Setting \(\alpha\) fixed, we can show that for a \(m > > 0\) \[z_{1-\alpha/m} \approx \sqrt{2\log m}\]
There is asymptotically no dependence on \(\alpha\)
alpha = 0.05
m = 10^(2:12)
threshold = qnorm(1-alpha/m)
approx = sqrt(2*log(m))
plot(m,threshold,type="l",log="x")
lines(m,approx, col=2)