16
Stata Technical Bulletin
STB-8
might choose geographical areas at random and then choose persons at random within the geographical area. If you do, the
sample is said to be geographically clustered.
The problem in clustered samples is that they may be overly homogeneous. Let’s assume that, in our example above, we
choose neighborhoods at random and then choose, at random, persons within neighborhood. Assume also that we want to predict
the mean of earnings for all persons living in the overall sample area. The average earnings that we observe in your sample,
calculated in the standard unweighted way (summing earnings and dividing by sample size) is a fine (unbiased) estimate of the
underlying mean of earnings. The estimated standard error of the mean, calculated in the unweighted way, however, is probably
not a good estimate of the accuracy of our measurement of the mean. Persons who live in the same neighborhood are similar to
each other, especially in earnings. Therefore, the variance of earnings within neighborhood is too small and we will underpredict
the standard error of earnings.
Another way of thinking about this is that adding one more person from a neighborhood already existing in our sample
does not add as much information as adding another person in a new neighborhood. Say our data contains N persons from
K neighborhoods. Using standard statistical formulas, it is the N that we use in calculating the standard error of the mean. If
neighborhoods were perfectly homogenous (all residents within a neighborhood had identical earnings), it should be the K that
should enter our statistical formulas. We have K true observations since, once we obtain one person from the neighborhood, we
know everything the neighborhood has to tell us. In reality, the neighborhoods are not perfectly homogenous, and the effective
number of observations is somewhere between K and N.
Dealing with sampling issues
Stata’s pweight modifier and “Huber” commands (see [5s] huber) deal with probability-weighted and clustered samples.
Without getting into the mathematics, it is important to understand that the formulas for analytically weighted data do not
handle the problem even though many researchers act as if they do. Most statistical packages allow only two kinds of weights—
frequency and analytical—and the non-software-developer researcher is often forced to treat probability-weighted samples as
if they were analytically weighted. The justification for this, other than convenience, is that the formulas are mathematically
related in that the adjustments made to the mean (or the estimated coefficients in a regression setting) are the same in either
case. However, it is not the adjustment to the mean that is important—that is solely an efficiency issue. It is the adjustment to
standard errors that is vitally important and, on these grounds, the adjustments are quite different.
How one deals with probability-weighted samples has generated much philosophical debate and is polarized along disciplinary
lines. Survey statisticians use sampling weights and econometricians and sociologists, for instance, rarely use them (or even
know about them).
Econometricians begin with the following meta-theorem. You must weight when calculating means but it is not necessary
to weight when estimating regression coefficients. The following is the argument: You begin with the behavioral model
Vj = *jβ + ¾
and assume that the ej- are independent and identically distributed for all observations in the sample—tantamount to assuming that
the model is correct. Under that assumption, no weighting is necessary, one simply estimates the coefficients and the standard
errors with regression.
The Survey Statistician, on the other hand, typically has little interest in fitting a complete behavioral model—he tends to
be more interested in describing the state of the world as it is today. For instance, the Survey Statistician may be interested in
means and their comparison (as in public opinion polls). Although the goals and statistics may be “simple,” (e.g. comparing
the earnings of men and women), such comparisons are sometimes cast in a regression framework to allow for adjustment
of other effects. Whether the Survey Statistician simply compares means or estimates regressions, his philosophy is to use a
sampling weight which is inversely proportional to the probability of being sampled and to calculate a standard error that takes
the weighting explicitly into account.
Let us consider a real problem: we want to estimate the difference in earnings between men and women, and we have data
that contains 50% white respondents and 50% black respondents. The Econometrician writes down an earnings model, including
all the things that might affect earnings. He estimates the model without weights. Let us first assume our Econometrician is
sloppy. His earnings equation includes race, sex, education, age, and so on. He finds that women earn 83% the amount earned
by what appears to be an equivalent male. Now let’s assume our Econometrician is more careful. He examines the data carefully
and discovers that the earnings difference of men and women is different for blacks and whites: black women earn roughly the
same amount as black males, but white women earn 66% of the amount earned by white males. He therefore reports that the
earnings ratios are between 66 and 100%. Given that blacks comprise roughly 12% of the population, he reports an average
earnings ratio of 70%.