Stata Technical Bulletin
37
sxd2
Computing optimal sampling designs for two-stage studies
Marie Reilly, Epidemiology & Public Health, University College Cork, Ireland, [email protected]
Agus Salim, Department of Statistics, University College Cork, Ireland, [email protected]
Abstract: Commands are given for determining optimal sampling designs subject to fixed sample size, fixed budget, and fixed
precision, and each command illustrated by an example.
Keywords: two-stage studies, mean score method.
Background
The commands supplied here apply to two-stage studies where a dichotomous outcome variable y and some categorical
covariate(s) z are available for all study subjects at the first stage. While at the second stage, a subset of the study subjects
have some additional covariate(s) x measured. The second-stage covariates may be continuous. The second stage subjects can
be a stratified random sample, where the strata are defined by the levels of y and z. The mean score algorithm (Reilly and Pepe
1995) allows us to analyze the data from such a two-stage study incorporating all first and second stage observations.
The variance expression of the mean score estimate given by Reilly and Pepe (1995) shows that the variance depends on 1)
the total number of observations and 2) the second-stage sampling fractions in each of the strata defined by the different levels
of response y and first stage covariates z. Thus it is possible to minimize the variance of a particular variable by optimally
choosing the number of observations and/or the second-stage sampling fractions.
Syntax
optfixn depvar iinde^vrss∖ [if exp] [in range] [, firstVaarlStt') nKrenname) n2(#) var(#) coding(#) ]
optbud depvar iideeprass^ [if exp] [in range] [, first (rarlist) prev (renname) b(#) cl(#) c2(#)
var(#) coding(#) ]
optprec depvar [ Sndeprars ] [if exp] [in range] [, first (rarlist) prevVennaame) prec(#) cl(#) c2(#)
var(#) coding(#) ]
coding depvar firstt-stage-arss]
Description
We provide optimal sampling designs for three different scenarios. Each of these commands requires as input some pilot
data, with each stratum represented by more than two observations. Such a stratified random sample is correctly handled by the
mean score algorithm, called in the background, which uses the first-stage sample sizes or prevalences (also supplied by the
user) to correctly weight the analysis.
The optfixn command calculates the optimal sampling fractions at the second stage for the situation where first-stage
observations are already available and the total second-stage sample size has been decided. Such studies might arise in medical
research where a database of demographic particulars on study subjects is available and expensive data (such as laboratory or
radiology measurements) are to be collected for a subsample. Before running the optf ixn command, we strongly advise running
the coding command to see the order in which the vector of first-stage sample sizes for the various strata must be supplied.
coding creates a variable called grp_yz that identifies the groups formed by the various levels of response variable y and
first-stage covariates z. In the call to optf ixn, the first-stage sample sizes must be supplied in the same order as grp_yz, that
is, the first element of the vector is the first-stage sample size for grp_yz = 1, the second element is for grp_yz = 2, and so on.
The optbud command calculates the total number of study observations and the second-stage sampling fractions that will
maximize precision subject to an available budget. The user must also supply the unit cost of observations at the first and second
stage. This command is applicable to the situation where a study is being planned, but the total study size has not yet been
decided. Instead of first-stage sample sizes, this command expects a vector of prevalences (or estimated prevalences) for the
various strata. Again, we advise running coding first so that these prevalences are provided in the correct order.
The optprec command applies to the same scenario as optbud, where the total sample size is not yet decided. The
objective in this case is to calculate the total number of study observations and the second-stage sampling fractions that will
achieve a specified precision at minimum cost. As with the optbud command, optprec expects a vector of prevalences (or
estimated prevalences) for the various strata, and it is advisable to run the coding command first to see the order in which these
values should be supplied.