Stata Technical Bulletin
15
where x and a are the mean and s.d. of the x, and
Z(Y5μ,σ)= [l∕v‰2]e-(y-μ)2∕2σ
is the density function of a normally distributed variable Y with mean μ and s.d. σ. The confidence interval for Cq is
(Cg .100(1—α)sq, cq + ^iθθɑ-ɑ.) Sg).
meansd case. The value of cq is x + zq × s. Its s.e. is given by the formula
sq = s^l∕n + zl∕{2n-2'}.
The confidence interval for Cq is (cg — zɪoo(i-ɑ) × s*q,cq + zɪoo(i-ɑ) × s*).
References
Conover, W. J. 1980. Practical Nonparametric Statistics. 2d ed. New York: John Wiley & Sons.
Kendall, M. G. and A. Stuart. 1969. The Advanced Theory of Statistics, Vol. I. 3d ed. London: Griffin.
Mood, A. M. and F. A. Graybill. 1963. Introduction to the Theory of Statistics. 2d ed. New York: McGraw-Hill.
sg8 Probability weighting
William Rogers, CRC, FAX 310-393-7551
The introduction of Stata 3.0 included what to many is a new kind of weight, the pweight or sampling weight, along with
the more well-known fweights (frequency weights) and aweights (analytic weights).
fweights are conceptually easy—you have data where each observation reflects one or more real observations. fweights
are most easily thought of as a data-compression scheme. An observation might record income, age, etc., and a weight, say
5, meaning that this observation really reflects 5 people with exactly the same income, age, etc. The results of estimating a
frequency-weighted regression are exactly the same as duplicating each observation so that it appears in the data as many times
as it should and then estimating the unweighted regression. There are really no statistics here; just data management.
aweights do solve a real statistical problem. The data you are analyzing reflect averages. You do not know each individual’s
income, age, etc., you know the average income in data grouped on age, etc. Weighting is important when analyzing such
data because the accuracy of the averages increases as the sample size over which the average was calculated increases. An
observation based on averages of 1,000 people is relatively more important than an observation in the same data based on
an average of 5 people. In a regression context, for instance, mispredicting the 1,000-person average is far more serious than
mispredicting, by the same amount, the 5-person average.
pweights solve another statistical problem. You have data in which each observation is an individual—not a group
average—it is merely that some individuals were more likely to appear in your data than others. An observation with a small
probability of appearing, and therefore a large pweight (which is the inverse of the sampling probability) is not in any sense
a more accurate measurement of, say, earnings, than is the earnings recorded in an observation more likely to appear in the
data, and therefore the adjustment made to standard errors is in no way related to the adjustment made to standard errors in
the aweight case. What is related is the adjustment made to the mean parameter estimate—aweights and pweights adjust
means and regression coefficients in the same way. An observation with a high weight contributes more information on the mean
because, in the case of aweights, it is a more precise estimate and, in the case of pweights, because it was less likely to be
sampled and is therefore reflective of a larger underlying population.
pweighted data can arise both intentionally and unintentionally. One might intentionally oversample blacks relative to
whites (as is common in many social-science surveys) or the sick relative to the well (as is common in many epidemiological
studies). Alternatively, imagine a survey that is administered by mail and also imagine, as is typical, that certain types of
respondents are found, ex post, to have been more likely to respond than others. The group less likely to respond thus reflects
a larger underlying population, but the measurements on the individuals we do have are no more (or less) accurate than any of
our other measurements.
When one begins to consider how a sample is obtained, another issue arises, that of clustered sampling, an issue related to,
but conceptually different from, pweights. Let me first describe how a sample might come to be clustered and then consider
the statistical issues of such clustering.
Assume you are going to survey a population and that you will do this by sending interviewers into the field. It will be
more convenient (i.e., cheaper) if each interviewer can interview persons who are geographically close to each other, so you