PROVIDE Project Technical Paper 2005:1
February 2005
One hypothesis is that this is due to a mismatch of individuals within households rather than a
mismatch of households. This can occur when individuals in the LFS 2000:2 do not have the
same unique identification numbers (person numbers) as in the IES 2000.14 In order to avoid
these types of problems the LFS 2000:2 is used throughout as the main source of
demographic data, while the IES 2000 is only used for household income and expenditure
data (excluding wage or salary income data). More problematic are the “irreconcilable
differences” between the LFS 2000:2 and the IES 2000 weights (Van der Berg et al., 2003a).
As a rule of thumb the LFS 2000:2 person weights were used when working with person-
level data, while the IES 2000 household weights were used when working with household-
level income and expenditure data. Below we make some comparisons of the demographic
and labour income data of the IES 2000 and LFS 2000:2.
2.3.2. Comparing IES 2000 and LFS 2000:2 data
When merging the IES 2000 with the LFS 2000:2 there are 416 observations unique to the
IES that are not in the LFS, and 1 626 observations in the LFS not in the IES. Just over 98%
of the observations appear in both. As mentioned previously we use the LFS 2000:2 as the
main source of demographic information of persons. However, for those 416 observations that
are unique to the IES 2000 demographic data is of course not available in the LFS 2000:2. For
these observations the IES data is used. This prevents the loss of a substantial number of
records. As explained in detail in section 4.2.1, variables are ‘created’ for factors (mergefact),
labour income (mergeinclabp), activities (mergeact), gender (mergegender), age (mergeage),
province (mergeprov), location (mergeloc), race (mergerace) and person weights
(mergepwgt) by using the LFS 2000:2 data as the basis and substituting missing data points
by IES 2000 data points. These merge- variables are therefore in some sense ‘combined’ IES
2000 and LFS 2000:2 variables.
The choice between the LFS and IES factor income data is, however, not a straightforward
one. Initial explorations revealed large numbers of outliers in the LFS data. Comparison of
the two sources revealed a fair degree of correlation for the majority of the observations, but
in many cases the difference was substantial. This required some further explorations and
eventually a new ‘combined’ factor income variable was created that contained more of the
IES data points than mergefact. In Figure 1 the average wage and number of observations
falling within each factor group are compared. Four income variables are compared, namely
the original LFS labour income (inclabp_lfsorig), the original IES labour income
(inclabp_iesorig), and two versions of the new ‘combined’ income variable, inclabp_old and
inclabp_new. In this section the comparison of the original labour income variables are
discussed. From the figure it is clear that the average wages reported in the LFS are generally
14 This hypothesis is later shown to be wrong.
11
© PROVIDE Project