12
Stata Technical Bulletin
STB-20
Both the source (block.c) and a DOS executable (block.exe) are available on the STB-20 diskette. Unix users can modify
block.c and recompile it if they wish, although there are already several tools in Unix that provide the same service as block.
DOS users who have purchased DOS versions of Unix utilities may also have tools that replicate block. For me, block is a
simple, special-purpose tool. It does one job easily and well; it’s nice to have when you need it.
sg25 Interaction expansion
William Gould, Stata Corporation, FAX 409-696-4601
The syntax of xi is
xi term(s)
xi: any Jtatacommand'! VarlistwithJerms ...
where a term is of the form:
V.vnmame
i. varnamet * i. vamame%
V.vamame1avarmeme3
i. vamame± I varname^
or V.vnaname
I. varnameι * I. varname2
I. varname t*varname:i
I. varnamet ∣ varname^
varname, varrιame∣, and varname2 denote categorical variables and may be numeric or string. varname3 denotes a continuous,
numeric variable.
xi expands terms containing categorical variables into dummy variable sets by creating new variables and, in the second
syntax (xi : any Jtatajommand) executes the specified command with the expanded terms.
Background
The terms continuous, categorical, and indicator or dummy variables are used below. Continuous variables are variables
that measure something—such as height or weight—and at least conceptually can take on any real number over some range.
Categorical variables, on the other hand, take on a finite number of values each denoting membership in a subclass, for example
excellent, good, and poor—which might be coded 0, 1, 2 or 1, 2, 3 or even “Exc,” “Good,” and “Poor.” An indicator or dummy
variable—the terms are used interchangeably—is a special type of two-valued categorical variable that contains values 0, denoting
false, and 1, denoting true. The information contained in any fc-valued categorical variable can be equally well represented by
fc indicator variables. Instead of a single variable recording values representing excellent, good, and poor, one can have three
indicator variables, the first indicating the truth or falseness of “result is excellent,” the second “result is good,” and the third
“result is poor.”
xi provides a convenient way to convert categorical variables to dummy or indicator variables when estimating a model
(say with regress, logistic, etc.).
For instance, assume the categorical variable agegrp contains 1 for ages 20-24, 2 for ages 25-39, and 3 for ages 40-44.
(There is no one over 44 in our data.) As it stands, agegrp would be a poor candidate for inclusion in a model even if one
thought age affected the outcome. It would be poor because the coding would force the restriction that the effect of being in
the second age group must be twice the effect of being in the first and, similarly, the effect of being in the third must be three
times the first. That is, if one estimated the model,
У = βo+βι agegrp + Xβ2
the effect of being in the first age group is Д, the second 2Д, and the third 3Д. If the coding 1, 2, 3 is arbitrary, we could
just as well have coded the age groups 1, 4, and 9, and the effects would now be Д, 4Д, and 9∕3χ.
The solution to this arbitrariness is to convert the categorical variable agegrp to a set of indicator variables ɑɪ, a2, and
aw, where α, is 1 if the individual is a member of the ith age group and 0 otherwise. We can then estimate the model:
У — βo + βιιciι + β12ci2 + β13<⅛3 + X >2