Stata Technical Bulletin
17
The Sampling Statistician runs a very simple regression, earnings on gender, and includes sampling weights to account for
the oversampling of blacks in the data. He reports the difference that the ratio of female to male earnings is 70%.
The following are worth noting: (1) Using unweighted regression, the Econometrician produced an incorrect answer when
sloppy—that is, when the model was wrong; (2) the Sampling Statistician’s problem was easier than that of the Econometrician
and he had no chance of producing the wrong answer; (3) the careful Econometrician, on the other hand, not only produced the
right answer, but produced an answer that contained more information than that produced by the Sampling Statistician.
Let us now compare the approaches of the Econometrician and the Survey Statistician on the issue of weights. The
Econometrician can be proved wrong by the data; given a set of sampling probabilities, the Econometrician may find that they
are related to the residual and may also discover that there is no set of independent variables in his model to free the residual
of this “unexplainable” correlation. On the other hand, the Sampling Statistician may be confronted by a sensitivity analysis
showing that weights for which he has so carefully accounted do not matter, but in that case, he will merely argue that the
inference can only be made under the assumption that they do matter and add that we merely happened to be lucky this time.
The Sampling Statistician will argue that if the Econometrician wants to estimate behavioral models, that’s fine, but that is still
no reason for ignoring the weights. If the Econometrician wants to perform a sensitivity analysis ex post and finds that the
weights do not matter, that’s fine too. But if the Econometrician simply ignores the weights, that is not fine.
So far, we have not really distinguished between sampling weights and clustering. Mathematically, there are actually two
issues. Sampling weights have to do with the issue that two observations do not have the same probability of appearing in the
data. Clustering has to do with the issue that two observations may be somehow related in a way not otherwise described by
the variables in the data. To adjust standard errors for both, the estimator needs to know the sampling weights and the cluster
to which each observation belongs.
However, the Econometrician and the Sampling Statistician again have a characteristically different approach. The Econo-
metrician treats the clustering as if it were another element that needs to be modeled, and then proceeds as if the revised model
is correct. So he may introduce heterogeneity parameters and try to estimate them. “Variance components” models are one way
this is done. The Sampling Statistician wants his regression coefficients to reflect means or differences in means. He is more
interested in correcting the standard errors of the analysis he is already doing.
In attempting to estimate efficiently, the sloppy Econometrician may unwittingly downweight large clusters, since they
have less information per observation. From the Survey Statistician’s point of view, this potentially introduces a bias. However,
the careful Econometrician gains additional information on relationships among clustered observations that may be useful in
understanding the phenomenon under study.
How do you know where a given analysis fits, philosophically speaking? Econometricians sometimes use (or are forced by
data availability to use) reduced form models, in which case they should behave as if they were Sampling Statisticians. Sampling
Statisticians may sometimes use maximum-likelihood methods, but that does not make them Econometricians (logit analysis, for
example, can be a fancy way to compare proportions with adjustment). In short, if the analysis is anything less than an all-out
attempt at behavioral modeling, or if weighted analysis changes the results substantially, it needs to be considered from the
viewpoint of the Sampling Statistician.
This is where Huber’s method (implemented in Stata; see [5s] huber) is helpful. These commands take the philosophical
approach of the Sampling Statistician. With the Huber method, weighted or clustered problems can be estimated using regression,
logit, or probit estimation techniques. The calculations differ from the aweighted answers only in their standard errors.
If you have sampling weights or clusters, even if you think you have the “right” model, Huber’s method is one way you can
check your assumption. If the answers are substantially different than your weighted analysis, you know you have a problem. If
your goal was to estimate a reduced form in any case, your problem is also solved. If your goal was to estimate a full behavioral
model, you now know its time to reconsider the functional form, the hypothesized variables, or selection effects.