The name is absent



10


Stata Technical Bulletin


STB-20


ip6.1 Data and matrices

William Gould, Stata Corporation, FAX 409-696-4601

Heinecke’s mkmat program (ip6) provides a useful addition to Stata’s matrix commands. The addition is so useful, in
fact, that you may wonder how it was ever omitted from Stata. Indeed, we must admit that a popular question among Stata’s
matrix-programming language users is how to create data matrices.

Heinecke’s program provides a solution, but it is a solution that will work only with small data set sizes. Stata limits
matrices to being no more than matsize × matsize which, by default, means 40 × 40 and, even with Intercooled Stata, means no
more than 400 × 400. Such limits appear to contradict Stata’s claims of being able to process large data sets. By limiting Stata’s
matrix capabilities to matsize × matsize, has not Stata’s matrix language itself been limited to data sets no larger than matsize?
It would certainly appear so; in the simple matrix calculation for regression coefficients (X'X)-1X'y, X is an
n × к matrix
(n being the number of observations and fc the number of variables) and, given the matsize constraint, n must certainly be less
than 400.

Our answer is as follows: Yes, X is limited in the way stated but note that X'X is a mere к × к matrix and, similarly,
X'y only
к × 1. Both these matrices are well within Stata’s matrix-handling capabilities and Stata’s matrix accum command
(see [6m] accum) can directly create both of them.

Moreover, even if Stata could hold the n × к matrix X, it would still be more efficient to use matrix accum to form
X'X. X'X, interpreted literally, says to load a copy of the data, transpose it, load a second copy of the data, and then form
the matrix product. Thus, two copies of the data occupy memory in addition to the original copy Stata already had available
(and from which matrix accum could directly form the result with no additional memory use). For small
n, the inefficiency
is not important but, for large
n, the inefficiency can be such as to actually make the calculation infeasible. (For instance, with
n = 12,000 and
к = 6, the additional memory use is 1,125K bytes.)

More generally, matrices in statistical applications tend to have dimension к × к, n × к, and n × n, with ⅛ small and
n large. Terms dealing with the data are of the generic form
X.'kιxnWnxnZnxk2. (X'X fits the generic form with X = X,
W = I, and Z = X.) Matrix programming languages are not capable of dealing with the deceivingly simple calculation X'WZ
because of the staggering size of the W matrix. For
n = 12,000, storing W requires a little more than a gigabyte of memory.
In statistical formulas, however, W is given by formula and, in fact, never needs to be stored in its entirety. Exploitation of this
fact is all that is needed to resurrect the use of a matrix programming language in statistical applications. Matrix programming
languages may be inefficient because of copious memory use, but in statistical applications, the inefficiency is minor for matrices
of size
к × к or smaller. Our design of the various matrix accum commands allow calculating terms of the form X'WZ and
this one feature, we have found, is all that is necessary to allow efficient and robust use of matrix languages.

Programs for creating data matrices such as that offered by Heinecke are useful for pedagogical purposes and, in addition,
I can imagine myself using it in some specific application where Stata’s matsize constraint is not binding; it seems so natural.
On the other hand, it is important that general tools not be implemented by forming data matrices because such tools will
be drastically limited in terms of the data set size. Coding the problem in terms of the various matrix accum commands is
admittedly more tedious but, by abolishing data matrices from your programs, you will produce tools suitable for use on large
data sets.

os14 A program to format raw data files

Phillip Swagel, Department of Economics, Northwestern University

Stata can easily read raw data from ASCII files as long as the data are stored rectangularly. For example, the file

11 12 13

21 22 23

31 32 33

can be read by typing infile xl x2 x3 using filename. In fact, this infile command will work even if the data are stored
in the following arrangement:

11 12 13

21 22

23 31 32 33



More intriguing information

1. CGE modelling of the resources boom in Indonesia and Australia using TERM
2. Eigentumsrechtliche Dezentralisierung und institutioneller Wettbewerb
3. Clinical Teaching and OSCE in Pediatrics
4. An Economic Analysis of Fresh Fruit and Vegetable Consumption: Implications for Overweight and Obesity among Higher- and Lower-Income Consumers
5. The name is absent
6. A Regional Core, Adjacent, Periphery Model for National Economic Geography Analysis
7. Wettbewerbs- und Industriepolitik - EU-Integration als Dritter Weg?
8. The economic value of food labels: A lab experiment on safer infant milk formula
9. An Investigation of transience upon mothers of primary-aged children and their school
10. Banking Supervision in Integrated Financial Markets: Implications for the EU
11. Macro-regional evaluation of the Structural Funds using the HERMIN modelling framework
12. Review of “The Hesitant Hand: Taming Self-Interest in the History of Economic Ideas”
13. The name is absent
14. Cyber-pharmacies and emerging concerns on marketing drugs Online
15. The name is absent
16. The name is absent
17. The name is absent
18. Evolutionary Clustering in Indonesian Ethnic Textile Motifs
19. THE DIGITAL DIVIDE: COMPUTER USE, BASIC SKILLS AND EMPLOYMENT
20. An Incentive System for Salmonella Control in the Pork Supply Chain