PROVIDE Project Technical Paper 2005:1
February 2005
When the cluster option is activated the standard error increases and the confidence
interval widens compared to the previous example where clustering was ignored. Also, deff
increases substantially due the effect of clustering on the precision, i.e. the variance
increases.13 When stratification is also taken into account the standard deviation declines and
the confidence interval becomes narrower in line with expectations (see section 2.2.3).
However, deff is still substantially higher than one.
In conclusion it can be said that the svy-commands are useful and indeed important to use
when the distribution of a variable is of concern. Income distribution data, for example, will
only be reliable when weights, clustering and stratification are specified. Test statistics will
also be more accurate. However, if the only concern is finding the means or total income or
expenditure (mean multiplied by the number of observations), normal analytic or frequency
weights will suffice.
2.3. Merging the IES 2000 and the LFS 2000:2
2.3.1. Overview
The IES 2000, unlike its predecessor, the IES 1995, contains enough information on
employment activities of household members to determine their occupation codes, industry
codes and wages or salaries. Employment data also appears in the LFS 2000:2 in somewhat
more detail. Therefore, depending on the information requirements, it may be unnecessary to
merge the two files. However, recently education data, which is only available in the LFS
2000:2, was required for the formation of new household groups for the PROVIDE SAM. As
a consequence it was necessary to merge these files, and hence the LFS 2000:2 employment
data became available within the IES 2000 in any event. Furthermore, since the LFS 2000:2 is
designed specifically to gather information on employment and related activities of the
population, the quality of the data is arguably better. For example, the IES 2000 only asks a
single question to determine a person’s occupation or industry code. In contrast, occupation
and industry codes in the LFS 2000:2 are based on a series of questions. Consequently there
are fewer ‘unspecified’ factors and industries in the LFS 2000:2 (see section 2.3.2).
Various researchers have encountered difficulties when merging the IES 2000 and LFS
2000:2 data files. Van der Berg et al. (2003a) find that when merging these datasets there are
a substantial number of observations for which age, gender and race variables do not match.
13 Incidentally, deff will equal one if none of pweight, psu and strata were specified, since the variance is then
simply equal to the sample variance as computed before in ci totinc. When only pweight is specified deff
increases to 1.23, which indicates that weighting (in this instance) increases the variability. A tabulation
of average weights by income deciles will reveal that the weights attached to high-income households is
higher than for low-income households. Thus, when weights are specified the inequality in the
distribution of income will increase since more weight is now attached to high-income households in the
sample.
10
© PROVIDE Project