PROVIDE Project Technical Paper 2005:1
2.2.6. Survey estimation in Stata
February 2005
Complex surveys typically have three characteristics: (1) the survey weights are inverse
probability weights; (2) the sample is drawn from clusters rather than from the entire
population; and (3) the data are stratified. Sampling weights, whether added to the data ex-
post or designed beforehand, have to be used to adjust for differing selection probabilities
between observations. Failure to use weights will result in biased estimates. When the sample
is drawn from clusters, observations are not independent. Many statistical estimators assume
independence and use of these estimators without making the correct adjustments will result
in standard errors being too small. Finally, since stratification can reduce estimates of
standard errors, it is also necessary to adjust for it.
Consider the following example.10 Suppose we wish to estimate the average total income
(variable totinc) of South African households. We can use the confidence interval (command
ci) to show the mean, standard error and the 95% confidence interval. In the Stata output
table below ‘unweighted’ data are used. This effectively means that the mean is the sample
mean, which is at its best a crude estimate of the population mean.
. ci totinc
Variable | Obs Mean Std. Err. [95% Conf. Interval]
-------------+-------------------------------------------------------------
totinc | 26177 39186.44 638.5181 37934.91 40437.97
If we use weights Stata will compute a more accurate estimate of the population mean.
Since pweight does not work with the ci command, we allow Stata to choose the type of
weight.11 However, if we do wish to use the pweight option, we have to make use of the
svymean command. Initially the svyset pweight wgtselect option is set, i.e. clustering and
stratification is ignored. The output of these two examples are listed below:
. ci totinc [weight = wgtselect]
(analytic weights assumed)
Variable | Obs Mean Std. Err. [95% Conf. Interval]
-------------+-------------------------------------------------------------
totinc | 26177 42793.12 653.4643 41512.29 44073.95
. svymean totinc
Survey mean estimation
pweight: wgtselect Number of obs = 26177
Strata: <one> Number of strata = 1
10 The ies2000h.dta database is used for the example (see section 3). The weight variable wgtselect is used. (The
current version of the ies2000h.dta has changed slightly since these examples were run - KP
15/02/2005).
11 Alternatively, we can specify frequency weights (fweight), but then the truncated version of the weight,
fwgtselect, has to be used since fweight only allows integer weights. This will give similar means and
standard deviations (see section 2.2.5).
8
© PROVIDE Project