Stata Technical Bulletin
39
We enter the vector of first-stage sample sizes as follows (note the transform operator is essential).
. matrix fstsamp=(6666, 1228, 144, 58)'
Assuming the objective is to fit the logistic regression model
logiti = βo + Asexi + ⅛agc,
and to minimize the variance of age, then we can obtain the optimal second-stage sample sizes.
. optfixn mort sex age, first(sex) nl(fstsamp) n2(1000) var(3)
the second stage sample sizes
— group(mor |
-+--- I |
— Freq. |
— |
— | |
1 |
I |
25 |
2 |
I |
25 |
3 |
I |
25 |
4 — |
I |
25 — |
please check the sample sizes !
grp.yz |
mort |
sex |
grP-z |
nl |
n2-pilot |
1 |
0 |
0 |
1 |
6666 |
25 |
2 |
0 |
1 |
2 |
1228 |
25 |
3 |
1 |
0 |
1 |
144 |
25 |
4 |
1 |
1 |
2 |
58 |
25 |
the optimal sampling fraction(sample size) for grp_yz 1 = .089 (596)
the optimal sampling fraction(sample size) for grp_yz 2 = .164 (202)
the optimal sampling fraction(sample size) for grp_yz 3=1 (144)
the optimal sampling fraction(sample size) for grp_yz 4=1 (58)
the minimum variance for age : .00008027
Total second stage sample size =1000
Note that these results tell us that to minimize the variance of age, we need to sample all the available cases, 8.9% of controls
in stratum 1 and 16.4% of controls in stratum 2.
Example 2
This second example also uses the CASS data. Let us suppose that we wish to set up a two-stage study where at the first stage
we will collect only the patient’s operative mortality, sex, and weight, while at the second stage we will collect the following
variables only for a subset of the study subjects.
1. Age of patients when they underwent bypass surgery.
2. The angina status of the patients when they underwent bypass surgery.
3. CHF score, that is, congestive heart failure score.
4. LVEDBP, that is, left ventricular end diastolic blood pressure.
5. Urgency of the surgery (1 for urgent, 0 for nonurgent).
Let us suppose that we have a budget of £10,000 available, that the cost of collecting data on one subject is £2 at the first
stage and £ 15 at the second stage, and that we would like to minimize the variance of LVEDBP in the logistic regression model
logiti = A + + sex, + ,⅛weight7 + Aagei + Aanginai + AclA + Alvedbpi + Asurgeryi
As before, we need to sample a few pilot second-stage observations from each stratum defined by the different levels of mortality
У and first stage covariates (sex and weight). Since first-stage covariates must be categorical, we first created a three-category
weight variable wtcat as 1 for weight < 60, 2 for 60 ≤ weight < 70, and 3 for weight ≥ 70. The first-stage sample sizes