Stata Technical Bulletin
presents a tabulation. In part, codebook makes this determination by the number of unique values of the variable. If the number
is 9 or fewer, codebook reports a tabulation, otherwise it reports summary statistics. tabulate(15) would change the rule to
produce tabulations whenever a variable takes on 15 or fewer unique values.
The mv option, which we specified above, asks codebook to search the data to determine the pattern of missing values.
This is a cpu-intensive task, which is the only reason that mv is an option. The result is useful. For instance, in the case of the
last variable tempjuly, codebook reported that every time tempjan is missing, tempjuly is missing and vice-versa. Looking
back up the output to the cooldd variable, codebook also reports that the pattern of missing values is the same for cooldd
and heatdd. In both cases, the correspondence is indicated with “<->”.
For cooldd, codebook also states that “tempjan==. —> cooldd==.”. The one-way arrow means that a missing tempjan
value implies a missing cooldd value, but a missing cooldd value does not necessarily imply a missing tempjan value.
codebook has some other features worth mentioning. When codebook determines that neither a tabulation nor summary
statistics are appropriate, for instance, in the case of a string variable or in the case of a numeric variable taking on many values
all of which are labeled, it reports a few examples instead. In the example above, codebook did that for the variable name.
codebook is also on the lookout for common errors you might make in dealing with the data. In the case of string variables,
this includes leading, embedded, and trailing blanks. codebook informed us that name includes embedded blanks. If name ever
had leading or trailing blanks, it would have mentioned that, too.
Another feature of codebook—this one for numeric variables—is to determine the units of the variable. For instance,
tempjan and tempjuly both have units of .1, meaning that temperature is recorded to tenths. codebook handles precision
considerations (note that tempjan and tempjuly are floats) in making this determination. If we had a variable in our data
recorded in 100s (e.g., 21,500, 36,800, etc.), codebook would have reported the units as 100. If we had a variable that took on
only values divisible by 5 (5, 10, 15, etc.), codebook would have reported the units as 5.
codebook, without arguments, is most usefully combined with log to produce a printed listing for enclosure in a notebook
documenting the data. codebook is, however, also useful interactively, since you can specify one or a few variables:
. codebook tempjan, mv
tempjan ------------------------------------------- Average January temperature
type: numeric (float)
range: [2.2,72.6] units: .1
unique values: 310 coded missing: 2 / 956
mean: 35.749
std. dev: 14.1881
percentiles: 10% 25% 50% 75% 90%
20.2 25.1 31.3 47.8 55.1
missing values: tempjuly==. <-> tempjan==.
crc14 Pairwise correlation coefficients
The already-existing correlate command calculates correlation coefficients using casewise deletion: when you request
correlations of variables æɪ, ж2, ∙ ∙ ∙, %k, any observation for which æɪ, ж2, ∙ ∙ ∙, %k are missing is not used. Thus, if x3 and
.r ∣ have no missing values, but .r2 is missing for half the data, the correlation between x3 and x↑ is calculated using only the
half of the data for which x2 is not missing. Of course, you can obtain the correlation between x3 and x↑ using all the data by
typing ‘correlate .rɜ Ж4’.
The new pwcorr command makes obtaining such pairwise correlation coefficients easier:
pwcorr Vvaιistt∖ [wegght∖ [if exp [in range [, obs sig print(#) star(#) bonferroni sidak ]
pwcorr calculates all the pairwise correlation coefficients between the variables in varlist or, if varlist is not specified, all the
variables in the data.
Options
obs adds a line to each row of the matrix reporting the number of observations used in calculating the correlation coefficient.
sig adds a line to each row of the matrix reporting the significance level of each correlation coefficient.