• Cruze

    1. Introduction

    Recently, there has been considerable promotion of so-called ‘bootstrap’ methods for estimating intervals. School students in the US are now routinely taught these methods.

    Hesterberg et al. (2010: 16-4) summarises the basic idea thus:

    ‘THE BOOTSTRAP IDEA 
    The original sample is representative of the population from which it was drawn. Thus, resamples from this original sample represent what we would get if we took many samples from the population. The bootstrap distribution of a statistic, based on the resamples, represents the sampling distribution of the statistic.’

    Note the use of the term ‘representative’ here. This is a problematic term, but in essence the claim is that the sample is a good approximation of the population.

    These methods make three assumptions:

    • Assumption 1. The sample is a good approximation of the population, and therefore a process of resampling from the sample with replacement (the bootstrap method) obtains a good enough estimate of the distribution of a mean (or other property) of the samples we would expect to draw from the population.
    • Assumption 2. A confidence interval for a sampled property can be modelled by a population interval predicting sampled properties.
    • Assumption 3. The resulting interval will be symmetric and follow the Normal distribution, hence we can use a Gaussian standard error from the bootstrapped samples.

    These assumptions are invalid.

    Assumption 1. The sample is a good approximation of the population

    Assumption 1 is circular. The entire point of inferential statistics is to draw an inference from an observed sample to an unobserved and unknowable population. We do not know whether the sample is a good approximation of the population for the property in question. We only know it is a sample, and that other samples are likely to differ from it.

    If a sample is unrepresentative of the population for the desired parameter, then resampling from it will mislead us. 

    Assumption 2. A confidence interval can be modelled by a population interval

    Standard bootstrap methods make the ‘Wald’ error of confusing two distinct things:

    1. the distribution of sample means given a population mean, and
    2. the distribution of population means given a sample mean.

    Wilson (1927) was one of the first statisticians to point out that Gaussian (‘Normal’) methods for Binomial data (the ‘Wald’ interval) confused these two distributions. The Wilson score interval for p is the inverse of the Gaussian population interval for a population mean P. It is not the same thing. (suite…)

Tendances