Stata Technical Bulletin
23
But Does It Work?
The bootstrap’s growing popularity derives partly from hope; its actual performance sometimes disappoints. Monte Carlo
simulation provides one way to evaluate bootstrapping objectively. The simulation generates samples according to a known
(user-designed) model; we then apply bootstrapping to discover (for example) how often bootstrap-based confidence intervals
actually contain the model parameters. example4.ado does this, embedding data resampling within a Monte Carlo simulation.
At the heart of example4.ado is a misspecified regression model. The usual standard errors and tests assume:
Y = βo + βι X + e [6]
with X fixed in repeated samples, and errors (e) normally, independently, and identically distributed (normal i.i.d.). But this
Monte Carlo simulation generates data according to the model:
Y = O + 3X + Xe [7]
with e distributed as χ2(l) — 1. (Note that this has a mean of 0 and a variance of 2.) X values, drawn from a χ2(1) distribution,
vary randomly. In Figure 3, 5,000 data points illustrate the problematic nature of model [7]: it challenges analysis with leverage,
outliers, skewed errors and heteroscedasticity. A Monte Carlo experiment drawing 10,000 random n=80 samples according to [7],
and analyzing them by ordinary least squares (OLS) reveals a nasty-looking sampling distribution (Figure 4). As expected, OLS
estimates are unbiased: the mean slope over 10,000 random samples (b = 2.99988) is indistinguishable from /3=3. Otherwise,
model [7] demolishes the usual OLS assumptions, and also those of residual resampling. Can data resampling still produce valid
inferences?
example4.ado explores this question. As listed here it calls for 100, n=80 Monte Carlo samples, with B=2,000 bootstrap
iterations per sample. (Results reported later represent 400 Monte Carlo samples, however.) For each Monte Carlo sample,
it obtains “90% confidence” intervals based on standard t-table procedures and three bootstrap methods: using 5th and 95th
percentiles; Hall’s “hybrid” percentile-reversal method (equation [3]); and the studentized or percentile-t method (equation [5]).
Finally, it calculates the width of each interval and checks whether the interval actually contains the parameter /3 = 3.
program define example4
* Monte Carlo simulation of bootstrap confidence intervals for a misspecified
* Cheteroscedastic, nonnormal errors) regression model. Generates IOO Monte
* Carlo samples, and resamples each of them 2,000 times,
drop _all
set more 1
set maxobs 2100
set seed 33333
capture erase example4.log
macro define .mcit=l
while ⅝-mcit<101 -(
quietly drop _all
quietly set obs 80
quietly generate X=CinvnormCuniform())) ^ 2
quietly generate Y=3*X+X*CCinvnormCuniformC)) ^ 2)-1)
* Previous two lines define the true model,
quietly regress Y X
macro define _orb=_b[X]
macro define _orSE=%_orb/sqrtC_resultC6))
* Perform the original-sample regression, storing slope
* as _orb and standard error _orSE.
quietly generate XX=.
quietly generate YY=.
quietly generate randnum=.
macro define _bsample=l
capture erase bstemp.log
log using bstemp.log
log off
while ⅝-bsample<2001 -(
* Begin bootstrap iterations, indexed by .bsample.
quietly replace randnum=intC-N*uniformC))+l
quietly replace XX=X[randnum]
quietly replace YY=YErandnum]
quietly regress YY XX
* Data resampling, not assuming i.i.d. errors,
log on
display %_orb
display ⅝.orSE
display _b[XX]