Computing optimal sampling designs for two-stage studies



Stata Technical Bulletin

37


sxd2


Computing optimal sampling designs for two-stage studies

Marie Reilly, Epidemiology & Public Health, University College Cork, Ireland, [email protected]
Agus Salim, Department of Statistics, University College Cork, Ireland,
[email protected]

Abstract: Commands are given for determining optimal sampling designs subject to fixed sample size, fixed budget, and fixed
precision, and each command illustrated by an example.

Keywords: two-stage studies, mean score method.

Background

The commands supplied here apply to two-stage studies where a dichotomous outcome variable y and some categorical
covariate(s)
z are available for all study subjects at the first stage. While at the second stage, a subset of the study subjects
have some additional covariate(s)
x measured. The second-stage covariates may be continuous. The second stage subjects can
be a stratified random sample, where the strata are defined by the levels of
y and z. The mean score algorithm (Reilly and Pepe
1995) allows us to analyze the data from such a two-stage study incorporating all first and second stage observations.

The variance expression of the mean score estimate given by Reilly and Pepe (1995) shows that the variance depends on 1)
the total number of observations and 2) the second-stage sampling fractions in each of the strata defined by the different levels
of response
y and first stage covariates z. Thus it is possible to minimize the variance of a particular variable by optimally
choosing the number of observations and/or the second-stage sampling fractions.

Syntax

optfixn depvar iinde^vrss [if exp] [in range] [, firstVaarlStt') nKrenname) n2(#) var(#) coding(#) ]

optbud depvar iideeprass^ [if exp] [in range] [, first (rarlist) prev (renname) b(#) cl(#) c2(#)
var(#) coding(#) ]

optprec depvar [ Sndeprars ] [if exp] [in range] [, first (rarlist) prevVennaame) prec(#) cl(#) c2(#)
var(#) coding(#) ]

coding depvar firstt-stage-arss]

Description

We provide optimal sampling designs for three different scenarios. Each of these commands requires as input some pilot
data, with each stratum represented by more than two observations. Such a stratified random sample is correctly handled by the
mean score algorithm, called in the background, which uses the first-stage sample sizes or prevalences (also supplied by the
user) to correctly weight the analysis.

The optfixn command calculates the optimal sampling fractions at the second stage for the situation where first-stage
observations are already available and the total second-stage sample size has been decided. Such studies might arise in medical
research where a database of demographic particulars on study subjects is available and expensive data (such as laboratory or
radiology measurements) are to be collected for a subsample. Before running the optf ixn command, we strongly advise running
the coding command to see the order in which the vector of first-stage sample sizes for the various strata must be supplied.
coding creates a variable called grp_yz that identifies the groups formed by the various levels of response variable
y and
first-stage covariates
z. In the call to optf ixn, the first-stage sample sizes must be supplied in the same order as grp_yz, that
is, the first element of the vector is the first-stage sample size for grp_yz = 1, the second element is for grp_yz = 2, and so on.

The optbud command calculates the total number of study observations and the second-stage sampling fractions that will
maximize precision subject to an available budget. The user must also supply the unit cost of observations at the first and second
stage. This command is applicable to the situation where a study is being planned, but the total study size has not yet been
decided. Instead of first-stage sample sizes, this command expects a vector of prevalences (or estimated prevalences) for the
various strata. Again, we advise running coding first so that these prevalences are provided in the correct order.

The optprec command applies to the same scenario as optbud, where the total sample size is not yet decided. The
objective in this case is to calculate the total number of study observations and the second-stage sampling fractions that will
achieve a specified precision at minimum cost. As with the optbud command, optprec expects a vector of prevalences (or
estimated prevalences) for the various strata, and it is advisable to run the coding command first to see the order in which these
values should be supplied.



More intriguing information

1. EXPANDING HIGHER EDUCATION IN THE U.K: FROM ‘SYSTEM SLOWDOWN’ TO ‘SYSTEM ACCELERATION’
2. Monopolistic Pricing in the Banking Industry: a Dynamic Model
3. Integrating the Structural Auction Approach and Traditional Measures of Market Power
4. The name is absent
5. New urban settlements in Belarus: some trends and changes
6. 09-01 "Resources, Rules and International Political Economy: The Politics of Development in the WTO"
7. Structural Breakpoints in Volatility in International Markets
8. Social Irresponsibility in Management
9. The Macroeconomic Determinants of Volatility in Precious Metals Markets
10. A THEORETICAL FRAMEWORK FOR EVALUATING SOCIAL WELFARE EFFECTS OF NEW AGRICULTURAL TECHNOLOGY
11. The name is absent
12. The urban sprawl dynamics: does a neural network understand the spatial logic better than a cellular automata?
13. The name is absent
14. Testing Gribat´s Law Across Regions. Evidence from Spain.
15. Graphical Data Representation in Bankruptcy Analysis
16. The name is absent
17. The name is absent
18. The name is absent
19. Trade Liberalization, Firm Performance and Labour Market Outcomes in the Developing World: What Can We Learn from Micro-LevelData?
20. Ongoing Emergence: A Core Concept in Epigenetic Robotics