10
Stata Technical Bulletin
STB-20
ip6.1 Data and matrices
William Gould, Stata Corporation, FAX 409-696-4601
Heinecke’s mkmat program (ip6) provides a useful addition to Stata’s matrix commands. The addition is so useful, in
fact, that you may wonder how it was ever omitted from Stata. Indeed, we must admit that a popular question among Stata’s
matrix-programming language users is how to create data matrices.
Heinecke’s program provides a solution, but it is a solution that will work only with small data set sizes. Stata limits
matrices to being no more than matsize × matsize which, by default, means 40 × 40 and, even with Intercooled Stata, means no
more than 400 × 400. Such limits appear to contradict Stata’s claims of being able to process large data sets. By limiting Stata’s
matrix capabilities to matsize × matsize, has not Stata’s matrix language itself been limited to data sets no larger than matsize?
It would certainly appear so; in the simple matrix calculation for regression coefficients (X'X)-1X'y, X is an n × к matrix
(n being the number of observations and fc the number of variables) and, given the matsize constraint, n must certainly be less
than 400.
Our answer is as follows: Yes, X is limited in the way stated but note that X'X is a mere к × к matrix and, similarly,
X'y only к × 1. Both these matrices are well within Stata’s matrix-handling capabilities and Stata’s matrix accum command
(see [6m] accum) can directly create both of them.
Moreover, even if Stata could hold the n × к matrix X, it would still be more efficient to use matrix accum to form
X'X. X'X, interpreted literally, says to load a copy of the data, transpose it, load a second copy of the data, and then form
the matrix product. Thus, two copies of the data occupy memory in addition to the original copy Stata already had available
(and from which matrix accum could directly form the result with no additional memory use). For small n, the inefficiency
is not important but, for large n, the inefficiency can be such as to actually make the calculation infeasible. (For instance, with
n = 12,000 and к = 6, the additional memory use is 1,125K bytes.)
More generally, matrices in statistical applications tend to have dimension к × к, n × к, and n × n, with ⅛ small and
n large. Terms dealing with the data are of the generic form X.'kιxnWnxnZnxk2. (X'X fits the generic form with X = X,
W = I, and Z = X.) Matrix programming languages are not capable of dealing with the deceivingly simple calculation X'WZ
because of the staggering size of the W matrix. For n = 12,000, storing W requires a little more than a gigabyte of memory.
In statistical formulas, however, W is given by formula and, in fact, never needs to be stored in its entirety. Exploitation of this
fact is all that is needed to resurrect the use of a matrix programming language in statistical applications. Matrix programming
languages may be inefficient because of copious memory use, but in statistical applications, the inefficiency is minor for matrices
of size к × к or smaller. Our design of the various matrix accum commands allow calculating terms of the form X'WZ and
this one feature, we have found, is all that is necessary to allow efficient and robust use of matrix languages.
Programs for creating data matrices such as that offered by Heinecke are useful for pedagogical purposes and, in addition,
I can imagine myself using it in some specific application where Stata’s matsize constraint is not binding; it seems so natural.
On the other hand, it is important that general tools not be implemented by forming data matrices because such tools will
be drastically limited in terms of the data set size. Coding the problem in terms of the various matrix accum commands is
admittedly more tedious but, by abolishing data matrices from your programs, you will produce tools suitable for use on large
data sets.
os14 A program to format raw data files
Phillip Swagel, Department of Economics, Northwestern University
Stata can easily read raw data from ASCII files as long as the data are stored rectangularly. For example, the file
11 12 13
21 22 23
31 32 33
can be read by typing infile xl x2 x3 using filename. In fact, this infile command will work even if the data are stored
in the following arrangement:
11 12 13
21 22
23 31 32 33