Science is run by humans. Science is also not one thing, but a lot of things. Keeping this in mind can save you a lot of headache.
Jeffrey Leek
Johns Hopkins Bloomberg School of Public Health
Science is run by humans. Science is also not one thing, but a lot of things. Keeping this in mind can save you a lot of headache.
R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics
Bootstrapping is a computational procedure for:
The basic idea behind bootstrapping is to treat the sample of data you observe (or some appropriate function of the data) as if it was the super-population you were sampling from. Then you sample from the observed set of values to try to approximate the sampling variation in the whole population.
The idea of the bootstrap was originally proposed by Efron in 1979
Related ideas are very old by the standards of statistics (Quenouille, 1956 Notes on bias in estimation. Tukey, 1958 Bias and confidence in not-quite large samples)


The plug-in principle states that if we have a parameter \(\theta = t(F)\), then we estimate the parameter by applying the same functional to an estimate of the distribution function \(\hat{\theta} = t(\hat{F}_n)\). Although other estimates can also be used\vsp The default \(\hat{F}_n = {\mathbb F}_n\) is the empirical distribution \[ {\mathbb F}_n(y) = \frac{1}{n} \sum_{i=1}^n 1(Y_i \leq y)\] A sample \(Y_i^'\) from \({\mathbb F}_n\) has the property that \(Y_i^' = Y_j\) with probability \(1/n\) for \(1 \leq j \leq n\)\vsp Why \({\mathbb F}_n\)?
Suppose \(Y_i > 0\) are i.i.d. from \(F\). We might be interested in: \[ \theta = {\rm E}_F \log(Y)\]
Let \(Y\) be a binary variable and suppose \[\theta(F) = Pr(Y = 1) = {\rm e}_F[1(Y_i = 1)]\] We can find the estimate of \(\theta(F)\) using the plug-in principle: \[\hat{\theta} = {\rm e}_{{\mathbb F}_n}[1(Y_i^' = 1)] = Ar{Y}\] Suppose we wanted an estimate for the variance \[{\rm Var}(\hat{\theta}) = {\rm Var}(Ar{Y}) = {\rm Var}(Y_i)/n\]
We could use the plug in estimator \[{\rm Var}_{{\mathbb F}_n}(Y_i)/n = {\rm e}_{{\mathbb F}_n}[(Y_i^' - Ar{Y})^2]/n = \frac{Ar{Y}(1-Ar{Y})}{n}\]
\[{\rm Var}_{{\mathbb F}_n}(Y_i)/n = {\rm E}_{{\mathbb F}_n}[(Y_i^' - Ar{Y})^2]/n\] \[= \sum_{j=1}^n \frac{1}{n}(Y_j - Ar{Y})^2\] \[=\frac{1}{n}\left[(1-Ar{Y})^2 \sum_{j=1}^n Y_j + Ar{Y}^2 \sum_{j=1}^n(1- Y_j)\right]\] \[=\frac{1}{n} \left[\sum_{j=1}^n Y_j - 2Ar{Y} \sum_{j=1}^n Y_j + nAr{Y}^2\right]\] \[= Ar{Y}^2 - Ar{Y} = Ar{Y}(1-Ar{Y})\]
When evaluating \({\rm E}_{{\mathbb F}_n}\), \(Ar{Y}\) is a ``parameter'' and treated as fixed.
Usually, no closed form evaluation will exist (we sort of "got lucky" in the previous examples). How did we get lucky? The plug-in estimate ended up being an expectation of a single random variable \(Y_i^'\).
What if we were unlucky and the plug in estimate was a function of all of the values \(Y_1^{'},\ldots,Y_n^{'}\)?
For example, suppose we wanted to estimate the variance of the sample median \(\hat{\theta} = {\mathbb F}_n^{-1}(1/2)\)?
In this case, the variance \({\rm Var}_{{\mathbb F}_n}(\hat{\theta})\) is an expectation of a function of \(Y_1^',\ldots,Y_n^'\).
It isn't clear there is a pretty formula for the variance of the median, but if we let \(X_j\) denote the number of times \(Y_j\) occurs in a bootstrap sample, then: \((X_1,\ldots,X_n) \sim Mult(n; 1/n,\ldots, 1/n)\) so: \[ \sum_{Y^'\in \mathcal{S}} \left\{{\mathbb F}_n^{'-1}(1/2) - {\rm E}_{{\mathbb F}_n}[{\mathbb F}_n^{-1}(1/2)]\right\}^2 \frac{n!}{\prod_{i=1}^n x_i!} (1/n)^n\] where \({\mathbb F}_n^{'-1}(1/2)\) is the sample median for each bootstrap sample and \(\mathcal{S}\) is the set of all unique bootstrap samples from \(Y_1,\ldots,Y_n\).
Most of the time, you'll use the bootstrap for parameters where a closed form doesn't necessarily exist. Instead we use a Monte Carlo approach. For the variance of the sample median you would:
As \(b \rightarrow \infty\) you get closer and closer to the exact or ``ideal bootstrap''.
The general form for calculating bootstrap standard errors is similar:
For a confidence interval \(a(Y), b(Y)\) based on a sample of size n, we may want to estimate coverage \[E_F 1_{[a < \theta < b]} = Pr(a < \theta < b | F)\]. The bootstrap estimate is \[E_{{\mathbb F}_n} 1_{[a < \hat{\theta} < b]} = Pr(a < \hat{\theta} < b | {\mathbb F}_n)\] The Monte-Carlo version of this calculation is:
Note this means you need two steps of calculation
Think about these in very frequentist terms:
We resampled \(Y\) and \(X\); whole rows of the dataset
The score contribution for the \(i^{th}\) subject for \(\beta_j\) is \(U_{ij} = \frac{\partial {\rm E}ll_i}{\partial \beta_j}\) so \(\sum_{i} \frac{\partial {\rm E}ll_i}{\partial \beta} = U^T_\beta \cdot 1\) where \(1\) is an \(n \times 1\) vector of ones. Note that \(\hat{\beta}\) satisfies \(U^T_\beta \cdot 1 = 0\).
Let \(\frac{\partial^2 {\rm E}ll }{\partial \beta \partial \beta^T} = -A\) The score evaluated at \(\beta = \hat{\beta}\) in the bootstrap sample is: \[U^T_{\hat{\beta}} \cdot W \neq 0\] where \(W\) is a vector of random weights.
The one step approximation to \(\hat{\beta}_W\) the ML estimate in the bootstrap sample, starting from \(\hat{\beta}\) is given by \[(\hat{\beta}_W^' - \hat{\beta}) \approx (\hat{\beta}^'_{one} - \hat{\beta}) \approx \hat{A}^{-1} U^T_{\hat{\beta}} W\]
Now \({\rm E}[W] = 1\) and \({\rm Var}(W) = I - \frac{1}{n} 1 \cdot 1^T\)
Thus the approximate bootstrap mean is: \[ {\rm E}_W(\hat{\beta}^'_{one} - \hat{\beta}) \approx \hat{A}^{-1} U^T_{\beta} {\rm E}(W) = 0\] and the approximate variance is:
\[{\rm E}_W(\hat{\beta}^'_{one} - \hat{\beta})(\hat{\beta}^'_{one} - \hat{\beta})^T\]
\(\approx \hat{A}^{-1} U_{\hat{\beta}}^T {\rm Var}{W} U_{\hat{\beta}} \hat{A}^{-1}\)$
\[= \hat{A}^{-1} \hat{B} \hat{A}^{-1}\]
where \(\hat{B} = \sum_{i=1}^n \hat{U}_i \hat{U}_i^T\)
Suppose we believe the assumed model \(F(\theta)\)
Usually the likelihood-based approaches give adequate intervals for parameters. We may be interested in analytically difficult functions of \(F\), e.g. medians, ranks, .. for example, taking derivatives may be awkward.
The parametric bootstrap is a useful approach; we take many replicates \(Y^'\) under \(\hat{F}_n = F(\hat{\theta})\)
Then look at the long-run, frequentist behavior of \(g(Y^')\). Typically, \(g()\) is some form of estimator.
The second term centers the \(e_i\), (should be minor, if \({\rm E}[{\rm E}psilon_i] = 0\)).
If we assumed that \({\rm E}psilon_i \sim_{ind} N(0,\sigma^2)\) then a parametric resampling residuals method samples from \[{\rm E}psilon_i^' \sim_{ind} N(0, {\rm Var}\{e_1,e_2,\ldots,e_n\})\]
Both methods respect the "design" of the \(x_i\)
Both methods break any relationship between \(x_i\) and \({\rm E}psilon_i\), so they assume homoskedasticity. This is a big assumption, often violated by real data. It is violated by assumption of you believe \(Y_i \sim Pois(\lambda_i)\).
The jackknife, invented by Tukey and Quenouille is closely related to the bootstrap. The jackknife is a leave one out procedure. Let \[Y_{(i)} = \{Y_1,Y_2,\ldots, Y_{i-1},Y_{i+1},\ldots Y_n\}\]
Similarly let \(\hat{\theta}_{(i)}\) be the estimate of a parameter based on \(Y_{(i)}\). Then the jackknife standard error estimate is: \[\hat{{\rm Var}}_{jack} = \frac{n-1}{n} (\sum \hat{\theta}_i - \bar{\hat{\theta}})^2\]
The reason for the \(n-1\) term is that the jackknife deviations are very close to each other (since we only leave out one observation), so they need to be inflated.
The jackknife can be thought of as a linear approximation to the bootstrap (see Efron 1979 or Efron and Tibshirani). For linear statistics: \[\hat{\theta} = \mu + \frac{1}{n} \sum_{i=1}^n \alpha(x_i)\] they will agree up to a constant.
This lack of smoothness causes the jackknife estimate of the standard error to be inconsistent for the median. For the mouse data, the jackknife values for the median are: \[\{48,48,48,48,45,43,43,43\}\]
There are only 3 values, a consequence of the lack of smoothness of the median and the fact that the jackknife data sets only differ by 1 observation. The estimated jackknife variance is 44.62. A bootstrap estimate of the standard error based on \(B=100\) bootstrap samples is 91.72, considerably larger.
This result, and validity of bootstrap methods, also holds for analysis of vector outcomes; by studying an empirical, data-based estimate of \(F\), we learn about e.g. \({\rm Var}_F[\hat{\beta}_n]\).
Two competing strategies for estimating \(F\) are:
Assume
\[Y_{ij} = b_i + Z_{ij}, 1 \leq i \leq n, 1 \leq j \leq n_i\] where \(n_i\) is constant for all i, and all \(b_i\) and \(Z_{ij}\) are independent with: \({\rm E}[b_i] = 0\), \({\rm E}[Z_{ij}] = 0\), \({\rm Var}[b_i] = \sigma^2_b\), \({\rm Var}[Z_{ij}] = \sigma^2_Z\) It follows that:
\[ {\rm Var}[Y_{ij}] = \sigma_b^2 + \sigma^2_Z\] \[{\rm Cov}[Y_{ij},Y_{ij'}] = {\rm Cov}[b_i + Z_{ij}, b_i + Z_{ij'}]\] \[= {\rm Var}[b_i] + {\rm Cov}[Z_{ij},Z_{ij'}] = \sigma^2_b\]
Note the intraclass correlation is \(\frac{\sigma^2_b}{\sigma^2_b + \sigma^2_Z}\); it can be interpreted as the correlation between two outcomes in the same cluster, or the proportion of total variance that is between clusters.
Denoting the bootstrap distribution by \(^'\), and denoting the random cluster number by \(I^'\), under either S1 or S2:
\[{\rm E}_{F^'}[Y_{I'j} | I^' = i] = n_i^{-1} \sum_l Y_{il} = \bar{Y}_i\] \[{\rm E}_{F^'}[Y_{I'j}^2 | I^' = i] = n_i^{-2} \sum_l Y_{il}^2\]
and consequently:
\[{\rm E}_{F^'}[Y_{ij}] = n_i^{-1} n^{-1} \sum_{ij} Y_{ij}\] \[{\rm E}_{F^'}[Y_{ij}^2] = n^{-1} \sum_{i} (\bar{Y}_i - \bar{Y})^2 + n^{-1} \sum_i n_i^{-1} (Y_{ij} - \bar{Y}_i)^2\]
The expectation of the resampled outcomes is unbaised for \({\rm E}_{F}[Y_{\cdot j}]\). Slightly more work gives: \[{\rm E}\left[{\rm Var}^'[Y^'_{\cdot j}]\right] = \frac{n-1}{n}\sigma^2_b + \frac{nn_i -1}{nn_i}\sigma^2_Z\]
For the cross-terms with \(j\neq k\) we get: \[{\rm E}_{F^'}[Y_{I^'j}Y_{I^'k} | I^' = i] = \frac{1}{n_i(n_i-1)}\sum_{l \neq m} Y_{il}Y_{im}\] \[{\rm E}_{F^'}[Y_{I^'j}Y_{I^'k} | I^' = i] = \frac{1}{n_i^2}\sum_{l \neq m} Y_{il}Y_{im}\] for models (S1) and (S2).Given a particular data set the bootstrap(s) give \[{\rm Cov}_{F^'}[Y_{ij}Y_{ik}] = n^{-1} \sum_i(\bar{Y}_i - \bar{Y})^2 - \frac{1}{nn_i(n_i -1)}\sum_{ij} (Y_{ij} - \bar{Y}_i)^2\] \[{\rm Cov}_{F^'}[Y_{ij}Y_{ik}]=\frac{1}{n} (\bar{Y}_i - \bar{Y})^2\]
for models (S1) and (S2) and this leads to: \[ {\rm E}[{\rm Cov}_{F^'}[Y_{ij}Y_{ik}]] = \frac{n-1}{n}\sigma^2_b - \frac{1}{nn_i} \sigma^2_Z\] \[ {\rm E}[{\rm Cov}_{F^'}[Y_{ij}Y_{ik}]] = \frac{n-1}{n}\sigma^2_b - \frac{n-1}{nn_i} \sigma^2_Z\]
Recall that this covariance is just \(\sigma^2_b\) in the true F; which formula above is the closest to that value?