The name is absent



Stata Technical Bulletin

13


The effect of being in age group 1 is now /3ц; 2, ∕3χ2; and 3, /З13; and these results are independent of our (arbitrary) coding. The
only difficulty at this point is that the model is unidentified in the sense that there are an infinite number of (∕30, ∕311, ∕312, ∕313)
that fit the data equally well.

To see this, pretend (∕30,∕3χχ,∕3χ2,∕3χ3) = (1,13,4). Then the predicted values of у for the various age groups are

p + 1 + 2 = 2 + 2 (age group 1)

у = < 1 + 3 + 2 = 4 + 2  (age group 2)

I 1 + 4 + 2 = 5 + 2  (age group 3)

Now pretend (/3о,/3ц,/312,/31з) = (2,0,2,3). Then the predicted values of у are

p + 0 + 2 = 2 + 2 (age group 1)

у = <j 2 + 2 + 2 = 4 + 2  (age group 2)

I 2 + 3 + 2 = 5 + 2  (age group 3)

These two sets of predictions are indistinguishable: for age group 1, у = 2 + 2 regardless of which coefficient vector is used,
and similarly for age groups 2 and 3. This arises because we have 3 equations and 4 unknowns. Any solution is as good as any
other and, for our purposes, we merely need to choose one of them. The popular selection method is to set the coefficient on
the first indicator variable to 0 (as we have done in our second coefficient vector). This is equivalent to estimating the model:

У = βθ + βl2Cl2 + βl3<^3 + Xβ2

How one selects a particular coefficient vector (identifies the model) does not matter. It does, however, affect the interpretation
of the coefficients.

For instance, we could just as well choose to omit the second group. In our artificial example, this would yield
(Л,/Зц,/312,/31з) = (4,-2,0,1) instead of (2,0,2,3). These coefficient vectors are the same in the sense that,

p + 0 + 2 = 2 + 2 = 4 - 2 + 2 (age group 1)

у = < 2 + 2 + 2 = 4 + 2 = 4 + 0 + 2 (age group 2)

I 2 + 3 + 2 = 5 + 2 = 4 + 1 + 2  (age group 3)

but what does it mean that /З13 can just as well be 3 or 1? We obtain /З13 = 3 when we set /3ц = 0, and so ‰ = ‰ — /3ц
and ∕3i3 measures the difference between age groups 3 and 1.

In the second case, we obtain ‰ = 1 when we set ∕3χ2 = 0, so ‰ — ∕3χ2 = 1 and ‰ measures the difference between
age groups 3 and 2. There is no inconsistency. According to our ∕3χ2 = 0 model, the difference between age groups 3 and 1 is
∕3χ3 — /3ц = 1 — (—2) = 3, exactly the same result we got in the /3ц = 0 model.

The issue of interpretation, however, is important because it can affect the way one discusses results. Imagine you are
studying recovery after a coronary bypass operation. Assume the age groups are (1) children under 13 (you have 2 of them),
(2) young adults under 25 (you have a handful of them), (3) adults under 46 (of which you have more yet), (4) mature adults
under 56, (5) older adults under 65, and (6) elder adults. You follow the prescription of omitting the first group, so all of your
results are reported relative to children under 13. While there is nothing statistically wrong with this, readers will be suspicious
when you make statements like, “compared to young children, older and elder adults ...”. Moreover, it is likely that you will
have to end each statement with “although results are not statistically significant” because you have only 2 children in your
comparison group. Of course, even with results reported in this way, you can do reasonable comparisons (say to mature adults),
but you will have to do extra work to perform the appropriate linear hypothesis test using Stata’s test command.

In this case, it would be better if you forced the omitted group to be more reasonable, such as mature adults. There
is, however, a generic rule for automatic comparison group selection that, while less popular, tends to work better than the
omit-the-first-group rule. That rule is to omit the most prevalent group. The most prevalent is usually a reasonable baseline.

In any case, the prescription for categorical variables is

1. Convert each ⅛-valued categorical variable to fc indicator variables.

2. Drop one of the к indicator variables; any one will do but dropping the first is popular, dropping the most prevalent is
probably better in terms of having the computer guess at a reasonable interpretation, and dropping a specified one often
eases interpretation the most.



More intriguing information

1. A NEW PERSPECTIVE ON UNDERINVESTMENT IN AGRICULTURAL R&D
2. Inhimillinen pääoma ja palkat Suomessa: Paluu perusmalliin
3. Ahorro y crecimiento: alguna evidencia para la economía argentina, 1970-2004
4. The name is absent
5. Family, social security and social insurance: General remarks and the present discussion in Germany as a case study
6. Orientation discrimination in WS 2
7. Fortschritte bei der Exportorientierung von Dienstleistungsunternehmen
8. GROWTH, UNEMPLOYMENT AND THE WAGE SETTING PROCESS.
9. Cardiac Arrhythmia and Geomagnetic Activity
10. DISCUSSION: ASSESSING STRUCTURAL CHANGE IN THE DEMAND FOR FOOD COMMODITIES
11. Opciones de política económica en el Perú 2011-2015
12. Visual Perception of Humanoid Movement
13. Prizes and Patents: Using Market Signals to Provide Incentives for Innovations
14. The name is absent
15. STIMULATING COOPERATION AMONG FARMERS IN A POST-SOCIALIST ECONOMY: LESSONS FROM A PUBLIC-PRIVATE MARKETING PARTNERSHIP IN POLAND
16. The Folklore of Sorting Algorithms
17. Fiscal Reform and Monetary Union in West Africa
18. BEN CHOI & YANBING CHEN
19. The name is absent
20. Delayed Manifestation of T ransurethral Syndrome as a Complication of T ransurethral Prostatic Resection