Update to a program for saving a model fit as a dataset



Stata Technical Bulletin

37


sxd2


Computing optimal sampling designs for two-stage studies

Marie Reilly, Epidemiology & Public Health, University College Cork, Ireland, [email protected]
Agus Salim, Department of Statistics, University College Cork, Ireland,
[email protected]

Abstract: Commands are given for determining optimal sampling designs subject to fixed sample size, fixed budget, and fixed
precision, and each command illustrated by an example.

Keywords: two-stage studies, mean score method.

Background

The commands supplied here apply to two-stage studies where a dichotomous outcome variable y and some categorical
covariate(s)
z are available for all study subjects at the first stage. While at the second stage, a subset of the study subjects
have some additional covariate(s)
x measured. The second-stage covariates may be continuous. The second stage subjects can
be a stratified random sample, where the strata are defined by the levels of
y and z. The mean score algorithm (Reilly and Pepe
1995) allows us to analyze the data from such a two-stage study incorporating all first and second stage observations.

The variance expression of the mean score estimate given by Reilly and Pepe (1995) shows that the variance depends on 1)
the total number of observations and 2) the second-stage sampling fractions in each of the strata defined by the different levels
of response
y and first stage covariates z. Thus it is possible to minimize the variance of a particular variable by optimally
choosing the number of observations and/or the second-stage sampling fractions.

Syntax

optfixn depvar iinde^vrss [if exp] [in range] [, firstVaarlStt') nKrenname) n2(#) var(#) coding(#) ]

optbud depvar iideeprass^ [if exp] [in range] [, first (rarlist) prev (renname) b(#) cl(#) c2(#)
var(#) coding(#) ]

optprec depvar [ Sndeprars ] [if exp] [in range] [, first (rarlist) prevVennaame) prec(#) cl(#) c2(#)
var(#) coding(#) ]

coding depvar firstt-stage-arss]

Description

We provide optimal sampling designs for three different scenarios. Each of these commands requires as input some pilot
data, with each stratum represented by more than two observations. Such a stratified random sample is correctly handled by the
mean score algorithm, called in the background, which uses the first-stage sample sizes or prevalences (also supplied by the
user) to correctly weight the analysis.

The optfixn command calculates the optimal sampling fractions at the second stage for the situation where first-stage
observations are already available and the total second-stage sample size has been decided. Such studies might arise in medical
research where a database of demographic particulars on study subjects is available and expensive data (such as laboratory or
radiology measurements) are to be collected for a subsample. Before running the optf ixn command, we strongly advise running
the coding command to see the order in which the vector of first-stage sample sizes for the various strata must be supplied.
coding creates a variable called grp_yz that identifies the groups formed by the various levels of response variable
y and
first-stage covariates
z. In the call to optf ixn, the first-stage sample sizes must be supplied in the same order as grp_yz, that
is, the first element of the vector is the first-stage sample size for grp_yz = 1, the second element is for grp_yz = 2, and so on.

The optbud command calculates the total number of study observations and the second-stage sampling fractions that will
maximize precision subject to an available budget. The user must also supply the unit cost of observations at the first and second
stage. This command is applicable to the situation where a study is being planned, but the total study size has not yet been
decided. Instead of first-stage sample sizes, this command expects a vector of prevalences (or estimated prevalences) for the
various strata. Again, we advise running coding first so that these prevalences are provided in the correct order.

The optprec command applies to the same scenario as optbud, where the total sample size is not yet decided. The
objective in this case is to calculate the total number of study observations and the second-stage sampling fractions that will
achieve a specified precision at minimum cost. As with the optbud command, optprec expects a vector of prevalences (or
estimated prevalences) for the various strata, and it is advisable to run the coding command first to see the order in which these
values should be supplied.



More intriguing information

1. Innovation Trajectories in Honduras’ Coffee Value Chain. Public and Private Influence on the Use of New Knowledge and Technology among Coffee Growers
2. The name is absent
3. Analyse des verbraucherorientierten Qualitätsurteils mittels assoziativer Verfahren am Beispiel von Schweinefleisch und Kartoffeln
4. ‘Goodwill is not enough’
5. Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages
6. Licensing Schemes in Endogenous Entry
7. LOCAL CONTROL AND IMPROVEMENT OF COMMUNITY SERVICE
8. Industrial districts, innovation and I-district effect: territory or industrial specialization?
9. Direct observations of the kinetics of migrating T-cells suggest active retention by endothelial cells with continual bidirectional migration
10. he Effect of Phosphorylation on the Electron Capture Dissociation of Peptide Ions
11. TOWARDS THE ZERO ACCIDENT GOAL: ASSISTING THE FIRST OFFICER MONITOR AND CHALLENGE CAPTAIN ERRORS
12. The name is absent
13. The name is absent
14. Outline of a new approach to the nature of mind
15. Public Debt Management in Brazil
16. The name is absent
17. The name is absent
18. The Environmental Kuznets Curve Under a New framework: Role of Social Capital in Water Pollution
19. The name is absent
20. The name is absent