Stata Technical Bulletin
15
Consider “inverting” the model implied by the tests in the example above. In other words, instead of explaining the mean
fuel efficiency by noting whether a car is domestic or foreign, consider predicting whether a car is an import by measuring its
fuel efficiency. Since the dependent variable is qualitative (domestic/import), a logistic model is a natural framework for this
prediction exercise. It turns out that U∕mn = 1— ROC where ROC is the area under the ROC curve for the logistic model. Thus:
. logistic foreign mpg
Logit Estimates Number of obs = 74
chi2(l) = 11.49
Prob > chi2 = 0.0007
Log Likelihood = -39.28864 Pseudo R2 = 0.1276
— |
Odds Ratio |
Std. Err. |
z |
P>∣z∣ |
[957, Conf. |
— Interval] |
—————————+* ≡pg I |
1.173232 |
.0616972 |
3.038 |
0.002 |
1.058331 |
— 1.300608 |
. lroc, nograph
Logistic estimates for foreign
Area under ROC curve = 0.7286
References
Bradley, E. L. 1985. Overlapping coefficient. In Encyclopedia of Statistical Sciences, ed. S. Kotz and N. L. Johnson, vol. 6, 546-547. New York:
Wiley.
Fleiss, J. L. 1981. Statistical Methods for Rates and Proportions. 2d ed. New York: Wiley.
Gastwirth, J. L. 1975. Statistical measures of earnings differentials. The American Statistician 29: 32-35.
Inman, H. F. and E. L. Bradley, Jr. 1989. The overlapping coefficient as a measure of agreement between two probability distributions and point
estimation of the overlap of two normal densities. Communications in Statistics—Theory and Methodology 18: 3851-3874.
Moses, L. E., J. D. Emerson, and H. Hosseini. 1992. Analyzing data from ordered categories. In Medical Uses of Statistics, 2d ed., ed. J. C. Bailar III
and F. Mosteller, 259-279. Boston: NEJM Books.
sg28 Multiple comparisons of categories after regression-like methods
William H. Rogers, Stata Corporation, FAX 409-696-4601
In a typical experiment or survey setting, we compare the responses of two or more groups. If we estimate a parametric
model, the covariance matrix of the parameters supplies us with standard error estimates for any individual parameter or contrast.
The theory of hypothesis testing provides ways of using these estimated standard errors to calculate tests of hypotheses about the
responses of different groups. For example, we might test whether crop yield is affected by the application of various fertilizers.
These tests are well known and are provided by virtually every statistical package. Ambiguities arise, however, when we make
multiple comparisons; that is, when we test more than one hypothesis about a model.
We can illustrate this problem using the automobile data set provided with Stata. This data set contains a variable, rep78,
that records the repair record in 1978 of each car. rep78 is coded as ‘1’ for cars with poor repair records, as ‘2’ for cars with
fair repair records, and so on up to ‘5’ for cars with excellent repair records. For the sake of the example, we treat rep78 as a
categorical variable rather than as an ordinal variable.
An interesting question is whether the price of a car depends on its repair record. One way to answer this question is to
estimate a regression for price where the repair record is an explanatory variable. Since the repair record is a categorical variable,
we cannot enter it directly as a regressor. Instead, we use the tabulate command to create indicator or dummy variables, one for
each level of rep78. All but one of these dummies are entered in the price regression. (The set of all five dummies is collinear
with the constant term in the regression; either the constant or one of the dummies must be dropped.) In this parameterization,
the coefficient on each dummy variable estimates the difference in the average price between the indicated level of rep78 and
the level corresponding to the omitted dummy variable.
. use ∖stata∖auto
(1978 Automobile Data)
. tabulate rep78, generate(r)