PROVIDE Project Technical Paper 2005:1
totals and recalculates them where necessary. Before any of the actual ‘cleaning up’ can start
the problem of missing values has to be investigated.
February 2005
Usually missing values are coded in Stata as a dot (full stop). A large number of the
variables in IES 2000, fortunately only on the expenditure side, contain very large numbers of
missing values. Missing values in a Stata dataset create various problems. Any arithmetic
operation on a missing value yields a missing value, which becomes problematic if, for
example, total expenditure is to be calculated. Closer inspection revealed that large numbers
of missing values only occurred in those variables that relate to optional questions. This
created the suspicion that these are not true missing values, but rather a result of incorrect
coding by Statistics South Africa. The following definitions are defined to clarify matters, i.e.
observations that are coded with a full stop in the IES 2000 can fall into one of the following
three categories:
• Uncoded - Some questions in the IES 2000 questionnaire were optional. Optional sections
are preceded by a question that asks the respondent whether the expenses relating to that
section are relevant to the household. If they answer no they may skip the section. In
many instances Statistics South Africa coded expenses in these optional sections with
missing values when the section was skipped. These are defined as uncoded observations
and can be changed to zeroes.
• Miscoded - In some instances the preceding question to the optional sections was
answered in the negative, but positive expenses were nevertheless reported in the optional
section following the question. In these instances it is assumed that the original question
was miscoded and should have been coded as ‘yes’. Consequently the information content
in the section is left as is.
• ('True) missing values - The remaining missing values relate to respondents who should
have answered a section given their response to the preceding question, but failed to do
so. These are therefore true missing values. It can be argued that some of these missing
values are a result of miscoding, i.e. that the preceding question should have been coded
as ‘no’. However, there is no basis on which such an assumption can be made, and
consequently these values have to be treated as missing.
All variables coded with a full stop were systematically analysed to determine in which
category they fall. Table 8 shows all the missing values (uncoded and true missing values) in
the IES 2000 database, as well as those that were miscoded. The numbers of missing values
reported in the original database is shown in column C. Only expense categories that
38
© PROVIDE Project