Bootstrap/Bootstrapping

is used to estimate a parameter η of the distribution of a sample statistic 𝜃ˆ, via Monte Carlo simulations when it is too difficult to do it analytically
is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows for the calculation of standard errors, confidence intervals, and hypothesis testing

Why Use Bootstrap

Example - Computing Standard Error of Statistic

For example, we have analytical formulas for deriving standard errors of statistics: sample mean & sample proportion:

𝑆𝐸(𝑋̅) = 𝑠𝑞𝑟𝑡(𝜎²/𝑛) # standard error
𝑆𝐸ˆ(𝑋̅) = 𝑠𝑞𝑟𝑡(𝑠²/𝑛) # estimated standard error
𝑆𝐸(𝑝̅) = 𝑠𝑞𝑟𝑡[𝑝(1-𝑝)/𝑛] # standard error
𝑆𝐸ˆ(𝑝̅) = 𝑠𝑞𝑟𝑡[𝑝̅(1-𝑝̅)/𝑛] # estimated standard error

However, computing standard errors of other statistics (e.g. sample median, sample variance, sample interquartile range, etc) is almost always impossible!

For example, to compute the variance of a sample median 𝑉𝑎𝑟(𝑀̅):

take all possible samples of size 𝑛 from the population
for each sample compute the sample median 𝑀̅
then calculate their variance: 𝑉𝑎𝑟(𝑀̅) = 𝐄[(𝑀̅ - 𝐄[𝑀̅])²]

Typically we cannot observe all possible samples, so how can we estimate 𝑉𝑎𝑟(𝑀̅) based on just one sample? We use bootstrap!

Bootstrap Terminology

bootstrap sample - is a random sample drawn with replacement from the observed sample 𝑆 of the same size as 𝑆
bootstrap distribution - is the distribution of a statistic across a set of bootstrap samples
bootstrap estimator - is an estimator that is computed on basis of bootstrap samples

Bootstrap Algorithm

to estimate parameter η of the distribution of 𝜃ˆ:

first obtain a set of 𝑏 bootstrap samples {𝐵₁, ..., 𝐵_𝑏}. There are 2 ways, either:
- consider all possible bootstrap samples drawn with replacement from the given sample 𝑆 (often intractable)
- generate a large number 𝑏 of random bootstrap samples drawn with replacement from the given sample 𝑆
for each bootstrap sample 𝐵_𝑖 compute statistic 𝜃ˆ*_𝑖 the same way 𝜃ˆ was computed from the original sample 𝑆
estimate the parameter η of this bootstrap distribution {𝜃ˆ*_𝑖, ..., 𝜃ˆ*_𝑏}

Parametric Bootstrap vs Nonparametric Bootstrap

Click here to expand...

the approximated population can be done in 2 ways:

nonparametric bootstrap - the approximated population is that one sample of size 𝑛 taken from the original population 𝑃. This works because 𝑃ˆ close to true 𝑃 when sample size 𝑛 is large
parametric bootstrap - we know the population distribution type (e.g. normal, poisson, etc), but the distribution parameters 𝜃 are unknown. the approximated population is the population distribution form but with 𝜃 replaced by point estimates 𝜃ˆ

given a sample of size 𝑛 {𝑋₁, ..., 𝑋_𝑛} where each 𝑋_𝑖 is i.i.d. drawn from a population with 𝑐𝑑𝑓 𝐹

Parametric Bootstrap

use parametric bootstrap when 𝑐𝑑𝑓 𝐹 distribution type (e.g. normal, poisson, etc) of population is KNOWN, but the distribution parameters 𝜃 is unknown

estimated distribution 𝐹ˆ is the same as 𝐹 but with 𝜃 replaced by point estimates 𝜃ˆ

e.g. if 𝐹=𝑁(μ,𝜎²) then 𝐹ˆ=𝑁(𝑋̅,𝑠²)
- 𝑋̅ is computed from sample {𝑋₁, ..., 𝑋_𝑛}
- 𝑠²is computed from sample {𝑋₁, ..., 𝑋_𝑛}

simulate i.i.d. draws {𝑋₁*, ..., 𝑋_𝑛*} from 𝐹ˆ

Nonparametric Bootstrap

use non-parametric bootstrap when 𝑐𝑑𝑓 𝐹 distribution type (e.g. normal, poisson, etc) of population is UNKNOWN

estimated distribution 𝐹ˆ is empirical distribution (cdf), where:

𝐹ˆ(𝑥) = (1/𝑛) ∑_{1≤𝑖≤𝑛}𝐼(𝑋_𝑖≤𝑥)
where:
- 𝐼(𝑋_𝑖≤𝑥) - is the indicator function, that equals:
  - 1 when sample 𝑋_𝑖is less than or equal to 𝑥
  - 0 otherwise
- 𝑋_𝑖's are from the given sample {𝑋₁, ..., 𝑋_𝑛}