38
Stata Technical Bulletin
STB-58
Notice that each of the optfixn, opt bud, and optprec commands have an option coding, which can be used if one is
sure of the order in which the vector of first-stage sample sizes or prevalences should be entered. This option results in coding
being automatically called from inside the optimal sampling command. Since this results in the creation of variables named
grp_yz and grp_z, an error message will be generated if one already has variables with these names.
Options
V irst Vmrrlitt) specifies the first-stage covariates.
nl Vecarmee) specifies the vector of first-stage sample sizes for each stratum.
prev Vecarmee) specifies the vector of prevalences for each stratum.
n2(#) specifies the second-stage sample sizes (used only with optfixn).
b(#) specifies the available budget (used only with opt bud).
cl(#) specifies the cost per observation at the first stage (used with optbud and optprec).
c2(#) specifies the cost per observation at the second stage (used with optbud and optprec).
var(#) specifies the position in the logistic regression model of the covariate whose variance is to be minimized (that is,
optimized). For example, in the simple model Y = bo + δι-X"ι + b%Xw, if we want to minimize the variance of Xi, then
var = 2.
prec(#) specifies the desired precision, that is, the variance (used only with optprec).
coding(#) is a logical flag; the default of 0 (that is, false) means that prior to calling optfixn, optbud, or optprec one must
have run the coding command.
Example 1
The following example is from CASS (Coronary Artery Surgery Study) and appears in Reilly (1996). This study collected
data on the operative mortality and various risk factors for 8,096 subjects. Let us suppose that at the first stage we have only
mortality status Y and sex Z as specified in the table below, and that it has been agreed to record the age for a subsample of
1,000 subjects in order to estimate the sex-adjusted odds ratio for age. The example is fictitious as we do have all the covariates
on all subjects, but for illustrative purposes we ignore this information (that is, set values to missing). In order to compute
optimal sample sizes, we require pilot data in all of the strata of the table, and so we “sampled” (reset the missing values to
the actual age values) for a randomly selected 25 observations from each stratum. The resulting dataset of 100 observations is
available as pilotcas accompanying this insert.
male female
_______Y Z = O Z=I
alive У = 0 6,666 1,228
deceased У = 1 144 58
We start by computing the optimal allocation for a second-stage sample of 1,000.
. use pilotcas
. coding mort sex
grp.yz |
mort |
sex |
g IT-Z |
nobs |
1 |
0 |
0 |
1 |
25 |
2 |
0 |
1 |
2 |
25 |
3 |
1 |
0 |
1 |
25 |
4 |
1 |
1 |
2 |
25 |
for functions requiring first stage sample sizes∕prevalences
enter these in the order of grp_yz
The coding function tells us that we have to enter the vector of first-stage sample sizes in the order specified in the following
table.
First element
Second element
Third element
Fourth element
grp-yz = 1
grp_yz = 2
grp_yz = 3
grp_yz = 4
first-stage sample sizes for living (mort = 0) males (sex = 0)
first-stage sample sizes for living females
first-stage sample sizes for deceased (mort = 1) males
first-stage sample sizes for deceased females