Stata Technical Bulletin
21
The bootstrapping method of examplel.ado, data resampling, generalizes to resampling entire cases. In two-variable
regression, this means we resample (X,Y) pairs as in example2.ado.2
program define example2
* data-resampling regression bootstrap
* assumes variables "Y" and ,,X,' in "source.dta"
*
set more 1
drop „all
set maxobs 2000
* If source.dta contains > 2,000 cases, set maxobs higher,
quietly use source.dta
quietly drop if Y==. ∣ X==.
save, replace
quietly regress Y X
macro define _coefX=_b[X]
* _coefX equals the original-sample regression coefficient on X
capture erase bootdat2.log
log using bootdat2.log
log off
set seed Illl
macro define „bsample 1
while ⅝-bsample<1001 -(
* For confidence intervals or tests, we need 2000 or more
* bootstrap samples.
quietly use source.dta, clear
generate randnum=int(-N*uniform())+l
quietly generate YY=Y[randnum]
quietly generate XX=X[randnum]
quietly regress YY XX
* The last three commands randomly resample (X,Y) pairs
* from the data.
macro define _bSE=_b[XX]∕sqrt(„result (6))
log on
display ⅝-bsample
display „b[_cons]
display _b[XX]
display 7»_bSE
display (_b[XX]-⅜.coefX)∕⅜.bSE
* Calculated either way, this command obtains a
* studentized coefficient:
* (bootstrap coef. - original coef.)∕SE of bootstrap coef.
display
log off
macro define _bsample=%_bsample+l
>
log close
drop „all
infile bsample bcons bcoefX bSE StucoefX using bootdat2.log
label variable bsample "bootstrap sample number"
label variable bcons "sample Y-intercept, bθ"
label variable bcoefX "sample coefficient on X, bl"
label variable bSE "sample standard error of bl"
label variable StucoefX "studentized coefficient on X"
label data "regression boot∕data resampling"
save boot2.dta, replace
end
Figure 2 shows two distributions obtained by bootstrapping the regression of New York air pollution on population
density. Data resampling (at top in Figure 2) does not make the usual regression assumptions of fixed X and independent,
identically distributed (i.i.d.) errors. Consequently it often yields larger standard error estimates and skewed, multimodal sampling
distributions. If the usual assumptions are false, we are right to abandon them, and bootstrapping may provide better guidance.
If the assumptions are true, on the other hand, data resampling is too pessimistic.
Since it scrambles the case sequence, data resampling is also inappropriate with time or spatial series. We could get bootstrap
time series in which 1969 appears three times, and 1976 not at all, for instance.
Residual resampling, an alternative regression bootstrap approach, retains the fixed-X and i.i.d.-errors assumptions. Residuals
from the original-sample regression, divided by ʌ/l — K∕eN, are resampled and added to original-sample Y values to generate
bootstrap Y* values, which then are regressed on original-sample X values. example3. ado illustrates, using the same two-variable
model as example2.ado. Results appear at bottom in Figure 2. Comments explain features new since example2.ado.