Bootstrap/Bootstrapping (Statistics)

Bootstrap/Bootstrapping (Statistics)

Bootstrap/Bootstrapping

  • is used to estimate a parameter η of the distribution of a sample statistic 𝜃ˆ, via Monte Carlo simulations when it is too difficult to do it analytically
  • is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows for the calculation of standard errors, confidence intervals, and hypothesis testing

Why Use Bootstrap

 Example - Computing Standard Error of Statistic

For example, we have analytical formulas for deriving standard errors of statistics: sample mean & sample proportion:

  • 𝑆𝐸(𝑋̅) = 𝑠𝑞𝑟𝑡(𝜎2/𝑛) # standard error
  • 𝑆𝐸ˆ(𝑋̅) = 𝑠𝑞𝑟𝑡(𝑠2/𝑛) # estimated standard error
  • 𝑆𝐸(𝑝̅) = 𝑠𝑞𝑟𝑡[𝑝(1-𝑝)/𝑛] # standard error
  • 𝑆𝐸ˆ(𝑝̅) = 𝑠𝑞𝑟𝑡[𝑝̅(1-𝑝̅)/𝑛] # estimated standard error

However, computing standard errors of other statistics (e.g. sample median, sample variance, sample interquartile range, etc) is almost always impossible!

For example, to compute the variance of a sample median 𝑉𝑎𝑟(𝑀̅):

  • take all possible samples of size 𝑛 from the population
  • for each sample compute the sample median 𝑀̅
  • then calculate their variance: 𝑉𝑎𝑟(𝑀̅) = 𝐄[(𝑀̅ - 𝐄[𝑀̅])2]

Typically we cannot observe all possible samples, so how can we estimate 𝑉𝑎𝑟(𝑀̅) based on just one sample? We use bootstrap!

Bootstrap Terminology

  • bootstrap sample - is a random sample drawn with replacement from the observed sample 𝑆 of the same size as 𝑆
  • bootstrap distribution - is the distribution of a statistic across a set of bootstrap samples
  • bootstrap estimator - is an estimator that is computed on basis of bootstrap samples

Bootstrap Algorithm

to estimate parameter η of the distribution of 𝜃ˆ:

  • first obtain a set of 𝑏 bootstrap samples {𝐵1, ..., 𝐵𝑏}. There are 2 ways, either:
    • consider all possible bootstrap samples drawn with replacement from the given sample 𝑆 (often intractable)
    • generate a large number 𝑏 of random bootstrap samples drawn with replacement from the given sample 𝑆
  • for each bootstrap sample 𝐵𝑖 compute statistic 𝜃ˆ*𝑖 the same way 𝜃ˆ was computed from the original sample 𝑆
  • estimate the parameter η of this bootstrap distribution {𝜃ˆ*𝑖, ..., 𝜃ˆ*𝑏}

Parametric Bootstrap vs Nonparametric Bootstrap

 Click here to expand...

the approximated population can be done in 2 ways:

  • nonparametric bootstrap - the approximated population is that one sample of size 𝑛 taken from the original population 𝑃. This works because 𝑃ˆ close to true 𝑃  when sample size 𝑛 is large
  • parametric bootstrap - we know the population distribution type (e.g. normalpoisson, etc), but the distribution parameters 𝜃 are unknown. the approximated population is the population distribution form but with 𝜃 replaced by point estimates 𝜃ˆ

given a sample of size 𝑛 {𝑋1, ..., 𝑋𝑛} where each 𝑋𝑖 is i.i.d. drawn from a population with 𝑐𝑑𝑓 𝐹

Parametric Bootstrap

use parametric bootstrap when 𝑐𝑑𝑓 𝐹 distribution type (e.g. normalpoisson, etc) of population is KNOWN, but the distribution parameters 𝜃 is unknown

estimated distribution 𝐹ˆ is the same as 𝐹 but with 𝜃 replaced by point estimates 𝜃ˆ

  • e.g. if 𝐹=𝑁(μ,𝜎2) then 𝐹ˆ=𝑁(𝑋̅,𝑠2)
    • 𝑋̅ is computed from sample {𝑋1, ..., 𝑋𝑛}
    • 𝑠is computed from sample {𝑋1, ..., 𝑋𝑛}

simulate i.i.d. draws {𝑋1*, ..., 𝑋𝑛*} from 𝐹ˆ

Nonparametric Bootstrap

use non-parametric bootstrap when 𝑐𝑑𝑓 𝐹 distribution type (e.g. normalpoisson, etc) of population is UNKNOWN

estimated distribution 𝐹ˆ is empirical distribution (cdf), where:

  • 𝐹ˆ(𝑥) = (1/𝑛) ∑1≤𝑖≤𝑛𝐼(𝑋𝑖≤𝑥)
  • where:
    • 𝐼(𝑋𝑖≤𝑥) - is the indicator function, that equals:
      • 1 when sample 𝑋𝑖 is less than or equal to 𝑥
      • 0 otherwise
    • 𝑋𝑖's are from the given sample {𝑋1, ..., 𝑋𝑛}

simulate i.i.d. draws {𝑋1*, ..., 𝑋𝑛*} from 𝐹ˆ

Subpages

Resources