22
Stata Technical Bulletin
STB-4
program define example3
* residual resampling regression bootstrap
* assumes variables ”Y” and ,,X,' in ’’source.dta”
*
set more 1
drop „all
set maxobs 2000
* If source.dta contains > 2,000 cases, set maxobs higher,
quietly use source.dta
quietly drop if Y==. ∣ X==.
quietly regress Y X
capture predict Yhat
capture predict e, resid
quietly replace e=e/sqrt(l-((_result(3)+l)/_result(l)))
* Previous two commands obtain full-sample regression
* residuals, and ’’fatten” them, dividing by:
* sqrt(l - K/_N)
* where K is # of model parameters and _N is sample size,
macro define _coefX=_b[X]
quietly save, replace
capture erase bootdat3.log
log using bootdat3.log
log off
set seed Illl
macro define „bsample 1
while ⅝-bsample<1001 -(
quietly use source.dta, clear
quietly generate ee=e[int(_N*uniform())+l]
quietly generate YY=Yhat+ee
quietly regress YY X
* We resample residuals only, then generate bootstrap
* Y values (called YY) by adding bootstrap residuals (ее)
* to predicted values from the original-sample
* regression (Yhat). Finally, regress these bootstrap
* YY values on original-sample X.
macro define _bSE=_b[X]∕sqrt(„result(6))
log on
display ⅝-bsample
display „b[_cons]
display _b[X]
display 7»_bSE
display (_b [X]-%_coefX)/%_bSE
display
log off
macro define _bsample=%_bsample+l
>
log close
drop „all
infile bsample bcons bcoefX bSE StucoefX using bootdat3.log
label variable bsample ’’bootstrap sample number”
label variable bcons ’’sample Y-intercept, bO”
label variable bcoefX ’’sample coefficient on X, bl”
label variable bSE ’’sample standard error of bl”
label variable StucoefX ’’studentized coefficient on X”
label data ’’regression boot∕residual resampling”
save boot3.dta, replace
end
To summarize our results in the regression of New York air pollution (Y) on population density (X):
slope standard error
original sample |
5.67∙10-6 |
7.13∙10-r |
bootstrap—data resampling |
6.24∙10-6 |
21.0∙10-r |
bootstrap—residual resampling |
5.66∙10-e |
7.89∙10-r |
Since they both assume fixed X and i.i.d. errors, results from residual resampling resemble results from the original-sample
regression (but with about 10% higher standard error). In contrast, data resampling obtains a standard error almost three times
the original-sample estimate, and a radically nonnormal distribution (skewness=3.6, kurtosis=18.3) centered right of the original-
sample regression slope. The differences in sampling distributions seen in Figure 2 dramatize how crucial the fixed-X and i.i.d.
errors assumptions are.