Fail fearlessly. Learning things is about trying not to fail. Discovering things is about failing all the time and getting back up and trying again. Research is about failing over and over and keeping your spirits up.
Jeffrey Leek
Johns Hopkins Bloomberg School of Public Health
Fail fearlessly. Learning things is about trying not to fail. Discovering things is about failing all the time and getting back up and trying again. Research is about failing over and over and keeping your spirits up.
The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?
The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple Is treatment A better than treatment B?
The era of scientific mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. Which variables matter among the thousands measured? How do you relate unrelated information?
http://www-stat.stanford.edu/~ckirby/brad/papers/2010LSIexcerpt.pdf
Suppose you are testing a hypothesis that a parameter \(\beta\) equals zero versus the alternative that it does not equal zero. These are the possible outcomes.
\(\beta=0\) | \(\beta\neq0\) | Hypotheses | |
---|---|---|---|
Claim \(\beta=0\) | \(U\) | \(T\) | \(m-R\) |
Claim \(\beta\neq 0\) | \(V\) | \(S\) | \(R\) |
Claims | \(m_0\) | \(m-m_0\) | \(m\) |
Type I error or false positive (\(V\)) Say that the parameter does not equal zero when it does
Type II error or false negative (\(T\)) Say that the parameter equals zero when it doesn't
False positive rate - The rate at which false results (\(\beta = 0\)) are called significant: \(E\left[\frac{V}{m_0}\right]\)*
Family wise error rate (FWER) - The probability of at least one false positive \({\rm Pr}(V \geq 1)\)
False discovery rate (FDR) - The rate at which claims of significance are false \(E\left[\frac{V}{R}\right]\)
If P-values are correctly calculated calling all \(P < \alpha\) significant will control the false positive rate at level \(\alpha\) on average.
Suppose that you call all \(P < 0.05\) significant.
The expected number of false positives is: \(10,000 \times 0.05 = 500\) false positives.
How do we avoid so many false positives?
The Bonferroni correction is the oldest multiple testing correction.
Basic idea:
Pros: Easy to calculate, conservative Cons: May be very conservative
I nstead of definining a per-test error rate, we can define an error rate over all of the tests, e.g.: \[{\rm Family\; wise \; error\; rate} = P(\{ {\rm \# \; of \; false \; positives} \geq 1\})\]
The most common (and first) method for controlling the FWER is the Bonferroni correction, if the rejection region for a single test is:
\[S_\alpha = \{p : p \leq \alpha\}\]
then if \(m\) tests are performed the rejection region is:
\[S^{bon}_\alpha = \{p_i : p_i \leq \alpha/m\}\]
Suppose there are \(m\) tests and the data for the first \(m_0\) tests follows the null distribution then: \[ P(\{ {\rm \# \; of \; false \; positives} \geq 1\}) = P\left(\sum_{i=1}^{m_0} I(p_i \leq \alpha/m) > 0\right)\] \[ = P\left(\bigcup_{i=1}^{m_0} \{p_i \leq \alpha/m\}\right)\] \[ \leq \sum_{i=1}^{m_0} P(p_i \leq \alpha/m)\] \[ \leq \frac{m_0}{m} \alpha \leq \alpha \]
\[ p^{bon}_i = \inf\{\alpha : p \in S_{\alpha}^{bon}\}\] \[ = \inf\{\alpha : p_i \leq \alpha/m\}\] \[ = \min\{m p_i,1\}\]
The adjusted p-value is no longer uniform under the null, but the adjusted p-value is attractive, because of the interpretation that \(p_i^{bon} \leq \alpha\) implies that FWER \(\leq \alpha\). See p.adjust
in R.
For independent test statistics we can be smarter:
\[ P( {\rm any\; null \; } p_i < \alpha/m) = 1 - P({\rm all \; null\;} p_i \geq \alpha/m)\] \[ = 1 - \left(\prod_{i=1}^{m_0} P(p_i \geq \alpha/m)\right)\] \[ = 1 - (1-\alpha/m)^{m_0}\] \[ \approx - (1-\alpha/m)^{m}\]
The last approximation is true when \(m \approx m_0\). We could use this to get a smarter threshold if we believe the tests are independent (they never are). But its not worth it because \(1-(1-\alpha/m)^m \approx 1-e^{-\alpha} \approx 1- (1 - \alpha) = \alpha\).
In the extreme case; all tests have almost the same \(p_j\); if one is small, they're all small. so:
\[ P ({\rm any\; null\;} p_i < \alpha/m) \approx m_0/m P(p_1 < \alpha/m)\] \[ = (m_0/m) (\alpha/m)\] \[ \approx \alpha/m\]
but using \(p_i < \alpha\) would have been better. For positively dependent test statistics increasing correlation \(\Rightarrow\) more conservative results on average. But we can get catastrophic errors.
Suppose \(p_i\) are all identical for the null cases and by chance \(p_i < \alpha/m\). How many errors?
"A genome-wide association study identifies three loci associated with susceptibility to uterine fibroids"
For each of \(\sim 1\times10^7\) SNPs with data \(X_i\) fit the model: \[ {\rm logit}(P(Y_j = 1 | X_ij)) = \beta_{0i} + \beta_{1i} X_{ij}\]
\[ Y = \beta_{0i} + \beta_{1i}X_i + \epsilon_i\]
Calculate a P-value for each \(\{p_1,\ldots,p_m\}\).
\(\beta=0\) | \(\beta\neq0\) | Hypotheses | |
---|---|---|---|
Claim \(\beta=0\) | \(U\) | \(T\) | \(m-R\) |
Claim \(\beta\neq 0\) | \(V\) | \(S\) | \(R\) |
Claims | \(m_0\) | \(m-m_0\) | \(m\) |
"Classic" Bonferroni limits \(\p(V \geq 1 | m)\); any \(V \geq 1\) is "equally bad"
A less conservative measure of (hypothetical) embarrassment \[\frac{V}{R\vee1} = \frac{\#{\rm false \; positives}}{\#{\rm declared \; positives}}\]
This is the most popular correction when performing lots of tests say in genomics, imagining, astronomy, or other signal-processing disciplines.
Benjamini and Hochberg (1995) defined a set of rules which control the \(FDR\), for independent tests
This set of decisions will have \(FDR =E \left[\frac{V}{R \vee 1}\right] \leq (m_0/m) \alpha\), for an \(m_0,m\).
The original proof of FDR control based on the BH algorithm was based on an induction argument. Storey, Taylor and Siegmund (2004) gave an elegant and generalizable alternative proof based on martingales that we will study. The basic steps are:
_Definition (Billingsley, adapted): _ Let \(X_t\) be a stochastic process on a probability space \((\Omega, \mathcal{F}, P)\) and let \(\{\mathcal{F}_t\}\) be an increasing set of \(\sigma\) algebras such that \(\mathcal{F}_s \subset \mathcal{F}_t\) when \(s < t\). Then \(X_t\) is a martingale with respect to the filtration \(\{\mathcal{F}_t\}\) if:
Put simply if \(X_t\) is a stochastic process such that the conditions hold, then \(E[X_t | X_s] = X_s\) for \(s < t\).
Definition: A stopping time with respect to \(\{X_t\}\) is a random variable \(\tau\) such that the event \(\{\tau = t\}\) is measurable with respect to \(\mathcal{F}_t\) and \(P(\tau < \infty) = 1\) almost surely.
Suppose a gambler wins 1 dollar every time a flipped coin lands heads and loses 1 dollar every time it comes up tails. After the $t$th flip he has \(X_t\) dollars. Then his expected winnings after the next flip is: \[ E[X_{t+1} | X_{t}] = (X_t + 1)\times \frac{1}{2} + (X_t - 1) \times \frac{1}{2} = X_t \]
So \(X_t\) is a martingale. Some examples of stopping rules are:
Examples of things that aren't stopping rules:
If \(\{X_t\}\) is a martingale with respect to the filtration \(\{\mathcal{F}_t\}\) and \(\tau\) is a stopping time for the martingale, then if \(E[\tau] < \infty\) and \(X_t\) is an integrable random variable then:
\[E[X_t ] = E[X_0]\]
The original proof of FDR control based on the BH algorithm was based on an induction argument. Storey, Taylor and Siegmund (2004) gave an elegant and generalizable alternative proof based on martingales that we will study. The basic steps are:
We start off with an estimate of the false discovery rate:
\[\widehat{FDR}(t) = \frac{\pi_0 t}{(R(t) \vee 1)/m}\]
To be conservative we can let \(\pi_0 = 1\) (but we could use a conservative estimate of \(\pi_0\) and the proof would still hold). Then define the random cutoff:
\[T_{\alpha}(\widehat{FDR}(t)) = \sup\{0\leq t \leq 1: \widehat{FDR}(t) \leq \alpha\}\]
Calling all \(\hat{p}_i < T_{\alpha}(\widehat{FDR}(t))\) significant is equivalent to the Benjamin-Hochberg procedure.
We need to show that \[p_{\hat{k}} \leq T_{\alpha}(\widehat{FDR}(t)) < p_{\hat{k} + 1}\] where \(\hat{k}\) is the BH cutoff. But \[\widehat{FDR}(p_{(k)}) = \frac{p_{(k)}}{k/m}\] so the BH cutoff is \[\hat{k} = \max\{k: p_{(k)} \leq \frac{k}{m}\alpha \} = \max\{k : \widehat{FDR}(p_{(k)}) \leq \alpha\}\] For \(k > \hat{k}\) we have \(\widehat{FDR}(p_{(k)}) > \alpha\) and for \(k \leq \hat{k}\) we have \(\widehat{FDR}(p_{(k)}) \leq \alpha\). So \(p_{\hat{k}} \leq T_{\alpha}(\widehat{FDR}(t)) < p_{\hat{k} + 1}\)
The original proof of FDR control based on the BH algorithm was based on an induction argument. Storey, Taylor and Siegmund (2004) gave an elegant and generalizable alternative proof based on martingales that we will study. The basic steps are:
The false discovery proportion at cutoff \(t\) is \[\frac{V(t)}{R(t)}\]
But \(T_{\alpha}(\widehat{FDR}(t)) = \sup\{0\leq t \leq 1: \frac{mt}{R(t) \vee 1} \leq \alpha\}\) and since \((m\times t)/R(t)\) has only positive jumps and a final value of 1, we have that \[\alpha = \frac{T_{\alpha}(\widehat{FDR}(t))\times m}{R[T_{\alpha}(\widehat{FDR}(t))]} \implies R[T_{\alpha}(\widehat{FDR}(t))] = T_{\alpha}(\widehat{FDR}(t)) \times m/\alpha\]
Therefore \[\frac{V[T_{\alpha}(\widehat{FDR}(t))]}{R[T_{\alpha}(\widehat{FDR}(t))]} = \frac{\alpha}{m} \frac{V[T_{\alpha}(\widehat{FDR}(t))]}{T_{\alpha}(\widehat{FDR}(t))}\]
The original proof of FDR control based on the BH algorithm was based on an induction argument. Storey, Taylor and Siegmund (2004) gave an elegant and generalizable alternative proof based on martingales that we will study. The basic steps are:
Lemma 1: If the p-values of the \(m_0\) null hypotheses are independent then \(\frac{V(t)}{t} = \frac{\sum_{i=1}^{m_0} 1(p_i \leq t)}{t}\) for \(0 \leq t \leq 1\) is a martingale with time running backward with respect to the filtration \(\mathcal{F}_t = \sigma(1\{p_i \leq s\}, t \leq s \leq 1,i=1,\ldots,m)\), in other words for \(s \leq t\) we have \(E[V(s)/s | \mathcal{F}_t] = V(t)/t\).
Lemma 2: The random variable \(T_{\alpha}(\widehat{FDR}(t))\) is a stopping time with respect to \(\mathcal{F}_t = \sigma(1\{p_i \leq s\}, t \leq s \leq 1,i=1,\ldots,m)\).
So finally, since the process \(V(t)/t\) stopped at \(T_{\alpha}(\widehat{FDR}(t))\) is bounded by \(m/\alpha\) the optional stopping theorem gives us:
\[FDR[T_{\alpha}(\widehat{FDR}(t))] = \frac{\alpha}{m} E\left[\frac{V[T_{\alpha}(\widehat{FDR}(t))]}{T_{\alpha}(\widehat{FDR}(t))}\right] = \frac{\alpha}{m} E[V(1)] = \frac{m_0}{m}\alpha\]
The positive false discovery rate is: \[{\rm pFDR} = E\left[\frac{V}{R} | R > 0\right]\] which can be compared to the FDR \[{\rm FDR} = E\left[\frac{V}{R} | R > 0\right] P(R > 0)\] The q-value is the pFDR analog of the p-value
\[\hat{p} = \hat{p}(X) = \inf\{\alpha : X \in S_\alpha\}\] \[\hat{q} = \hat{q}(X) = \inf\{{\rm pFDR}(S) : X \in S\}\]
See the qvalue
package in R.
Controlling all error rates at \(\alpha = 0.20\)
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
y <- rnorm(20)
x <- rnorm(20)
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
# Controls false positive rate
sum(pValues < 0.05)
[1] 51
# Controls FWER
sum(p.adjust(pValues,method="bonferroni") < 0.05)
[1] 0
# Controls FDR
sum(p.adjust(pValues,method="BH") < 0.05)
[1] 0
set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
x <- rnorm(20)
# First 500 beta=0, last 500 beta=2
if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
trueStatus <- rep(c("zero","not zero"),each=500)
table(pValues < 0.05, trueStatus)
trueStatus
not zero zero
FALSE 0 476
TRUE 500 24
# Controls FWER
table(p.adjust(pValues,method="bonferroni") < 0.05,trueStatus)
trueStatus
not zero zero
FALSE 23 500
TRUE 477 0
# Controls FDR
table(p.adjust(pValues,method="BH") < 0.05,trueStatus)
trueStatus
not zero zero
FALSE 0 487
TRUE 500 13
P-values versus adjusted P-values
par(mfrow=c(1,2))
plot(pValues,p.adjust(pValues,method="bonferroni"),pch=19)
plot(pValues,p.adjust(pValues,method="BH"),pch=19)
Notes:
Further resources: