Stata Technical Bulletin
13
The effect of being in age group 1 is now /3ц; 2, ∕3χ2; and 3, /З13; and these results are independent of our (arbitrary) coding. The
only difficulty at this point is that the model is unidentified in the sense that there are an infinite number of (∕30, ∕311, ∕312, ∕313)
that fit the data equally well.
To see this, pretend (∕30,∕3χχ,∕3χ2,∕3χ3) = (1,13,4). Then the predicted values of у for the various age groups are
p + 1 + Xβ2 = 2 + Xβ2 (age group 1)
у = < 1 + 3 + Xβ2 = 4 + Xβ2 (age group 2)
I 1 + 4 + Xβ2 = 5 + Xβ2 (age group 3)
Now pretend (/3о,/3ц,/312,/31з) = (2,0,2,3). Then the predicted values of у are
p + 0 + Xβ2 = 2 + Xβ2 (age group 1)
у = <j 2 + 2 + Xβ2 = 4 + Xβ2 (age group 2)
I 2 + 3 + Xβ2 = 5 + Xβ2 (age group 3)
These two sets of predictions are indistinguishable: for age group 1, у = 2 + Xβ2 regardless of which coefficient vector is used,
and similarly for age groups 2 and 3. This arises because we have 3 equations and 4 unknowns. Any solution is as good as any
other and, for our purposes, we merely need to choose one of them. The popular selection method is to set the coefficient on
the first indicator variable to 0 (as we have done in our second coefficient vector). This is equivalent to estimating the model:
У = βθ + βl2Cl2 + βl3<^3 + Xβ2
How one selects a particular coefficient vector (identifies the model) does not matter. It does, however, affect the interpretation
of the coefficients.
For instance, we could just as well choose to omit the second group. In our artificial example, this would yield
(Л,/Зц,/312,/31з) = (4,-2,0,1) instead of (2,0,2,3). These coefficient vectors are the same in the sense that,
p + 0 + Xβ2 = 2 + Xβ2 = 4 - 2 + Xβ2 (age group 1)
у = < 2 + 2 + Xβ2 = 4 + Xβ2 = 4 + 0 + Xβ2 (age group 2)
I 2 + 3 + Xβ2 = 5 + Xβ2 = 4 + 1 + Xβ2 (age group 3)
but what does it mean that /З13 can just as well be 3 or 1? We obtain /З13 = 3 when we set /3ц = 0, and so ‰ = ‰ — /3ц
and ∕3i3 measures the difference between age groups 3 and 1.
In the second case, we obtain ‰ = 1 when we set ∕3χ2 = 0, so ‰ — ∕3χ2 = 1 and ‰ measures the difference between
age groups 3 and 2. There is no inconsistency. According to our ∕3χ2 = 0 model, the difference between age groups 3 and 1 is
∕3χ3 — /3ц = 1 — (—2) = 3, exactly the same result we got in the /3ц = 0 model.
The issue of interpretation, however, is important because it can affect the way one discusses results. Imagine you are
studying recovery after a coronary bypass operation. Assume the age groups are (1) children under 13 (you have 2 of them),
(2) young adults under 25 (you have a handful of them), (3) adults under 46 (of which you have more yet), (4) mature adults
under 56, (5) older adults under 65, and (6) elder adults. You follow the prescription of omitting the first group, so all of your
results are reported relative to children under 13. While there is nothing statistically wrong with this, readers will be suspicious
when you make statements like, “compared to young children, older and elder adults ...”. Moreover, it is likely that you will
have to end each statement with “although results are not statistically significant” because you have only 2 children in your
comparison group. Of course, even with results reported in this way, you can do reasonable comparisons (say to mature adults),
but you will have to do extra work to perform the appropriate linear hypothesis test using Stata’s test command.
In this case, it would be better if you forced the omitted group to be more reasonable, such as mature adults. There
is, however, a generic rule for automatic comparison group selection that, while less popular, tends to work better than the
omit-the-first-group rule. That rule is to omit the most prevalent group. The most prevalent is usually a reasonable baseline.
In any case, the prescription for categorical variables is
1. Convert each ⅛-valued categorical variable to fc indicator variables.
2. Drop one of the к indicator variables; any one will do but dropping the first is popular, dropping the most prevalent is
probably better in terms of having the computer guess at a reasonable interpretation, and dropping a specified one often
eases interpretation the most.