Computing optimal sampling designs for two-stage studies

Stata Technical Bulletin

STB-58

Notice that each of the optfixn, opt bud, and optprec commands have an option coding, which can be used if one is
sure of the order in which the vector of first-stage sample sizes or prevalences should be entered. This option results in coding
being automatically called from inside the optimal sampling command. Since this results in the creation of variables named
grp_yz and grp_z, an error message will be generated if one already has variables with these names.

Options

V irst Vmrrlitt) specifies the first-stage covariates.

nl Vecarmee) specifies the vector of first-stage sample sizes for each stratum.

prev Vecarmee) specifies the vector of prevalences for each stratum.

n2(#) specifies the second-stage sample sizes (used only with optfixn).

b(#) specifies the available budget (used only with opt bud).

cl(#) specifies the cost per observation at the first stage (used with optbud and optprec).

c2(#) specifies the cost per observation at the second stage (used with optbud and optprec).

var(#) specifies the position in the logistic regression model of the covariate whose variance is to be minimized (that is,
optimized). For example, in the simple model Y = bo + δι-X"ι + b%Xw, if we want to minimize the variance of Xi, then
var = 2.

prec(#) specifies the desired precision, that is, the variance (used only with optprec).

coding(#) is a logical flag; the default of 0 (that is, false) means that prior to calling optfixn, optbud, or optprec one must
have run the coding command.

Example 1

The following example is from CASS (Coronary Artery Surgery Study) and appears in Reilly (1996). This study collected
data on the operative mortality and various risk factors for 8,096 subjects. Let us suppose that at the first stage we have only
mortality status Y and sex Z as specified in the table below, and that it has been agreed to record the age for a subsample of
1,000 subjects in order to estimate the sex-adjusted odds ratio for age. The example is fictitious as we do have all the covariates
on all subjects, but for illustrative purposes we ignore this information (that is, set values to missing). In order to compute
optimal sample sizes, we require pilot data in all of the strata of the table, and so we “sampled” (reset the missing values to
the actual age values) for a randomly selected 25 observations from each stratum. The resulting dataset of 100 observations is
available as pilotcas accompanying this insert.

male female

_______Y Z = O Z=I

alive У = 0 6,666 1,228

deceased У = 1 144 58

We start by computing the optimal allocation for a second-stage sample of 1,000.

. use pilotcas

. coding mort sex

grp.yz	mort	sex	g IT-Z	nobs
1	0	0	1	25
2	0	1	2	25
3	1	0	1	25
4	1	1	2	25

for functions requiring first stage sample sizes∕prevalences
enter these in the order of grp_yz

The coding function tells us that we have to enter the vector of first-stage sample sizes in the order specified in the following
table.

First element
Second element
Third element
Fourth element

grp-yz = 1

grp_yz = 2

grp_yz = 3

grp_yz = 4

first-stage sample sizes for living (mort = 0) males (sex = 0)
first-stage sample sizes for living females

first-stage sample sizes for deceased (mort = 1) males

first-stage sample sizes for deceased females

More intriguing information

1. The name is absent
2. The Tangible Contribution of R&D Spending Foreign-Owned Plants to a Host Region: a Plant Level Study of the Irish Manufacturing Sector (1980-1996)
3. The name is absent
4. The bank lending channel of monetary policy: identification and estimation using Portuguese micro bank data
5. An Investigation of transience upon mothers of primary-aged children and their school
6. Clinical Teaching and OSCE in Pediatrics
7. Estimating the Economic Value of Specific Characteristics Associated with Angus Bulls Sold at Auction
8. Improvement of Access to Data Sets from the Official Statistics
9. The name is absent
10. Artificial neural networks as models of stimulus control*