Stata Technical Bulletin
31
Options
by Ugroupvarn is not optional. It specifies the name of the grouping variable. This variable must have exactly two possible values.
The lower value indicates Group A, and the higher value indicates Group B.
centile Uunmlitt) specifies a list of percentile differences to be reported and defaults to centile(50) (median only). Specifying
centile(25 50 75) will produce the 25th, 50th, and 75th percentile differences.
level (#) specifies the confidence level, in percent, for confidence intervals. The default is level (95) or as set by set level.
eform specifies that exponentiated percentile differences are to be given. This option is used if depaar is the log of a positive-
valued variable. In this case, confidence intervals are calculated for percentile ratios between values of the original positive
variable, instead of for percentile differences.
c lust er Uaaaanme) specifies the variable which defines sampling clusters. If cluster is defined, then the percentiles are
calculated using the between-cluster Somers’ D, and the confidence intervals are calculated assuming that the data are a
sample of clusters from a population of clusters, rather than a sample of observations from a population of observations.
tdist specifies that the standardized Somers’ D estimates are assumed to be sampled from a t distribution with n — 1 degrees
of freedom, where n is the number of clusters or the number of observations if cluster is not specified.
transf UtrnnsOonnatinnaanme) specifies that the Somers’ D estimates are to be transformed, defining a standard error for the
transformed population value from which the confidence limits for the percentile differences are calculated. z (the default)
specifies Fisher’s z (the hyperbolic arctangent), as in specifies Daniels’ arcsine, and iden specifies identity or untransformed.
saving(filenαme[,replace]) specifies a dataset, to be created, whose observations correspond to the observed values of
differences between a value of depaar in Group A and a value of depaar in Group B. replace instructs Stata to replace
any existing dataset of the same name. The saved dataset can then be reused if cendif is called later, with using, to save
the large amounts of processing time used to calculate the set of observed differences. The saving option and the using
utility are provided mainly for programmers to use, at their own risk.
nohold indicates that any existing estimation results are to be overwritten with a new set of estimation results, for the use of
programmers. By default, any existing estimation results are restored after execution of cendif.
Remarks
cendif calls somersd (see Newson 2000), which has been updated, in order to take long variable lists. (It was previously
limited to eight variables.)
Methods and formulas
Suppose that a population contains two disjoint subpopulations A and B, and a random variable Y is defined for individuals
from both subpopulations. For 0 < q < 1, a IOOgth percentile difference in Y between Populations A and B is defined as a
value θ satisfying
D[Y*(θ')∖X] = l-2q (1)
where X is a binary variable equal to 1 for Population A and 0 for Population B, Y* (0) is defined as Y if X = 1 and Y + θ
if X = 0, and D[∙ ∣ ∙] denotes Somers’ D (Somers 1962, Newson 2000). Somers’ D is defined as
B[V∣IV] = E [sign(½ - V2) sign(IV1 - IV2) ] / E [sign(IV1 - IV2)2]
(2)
where (IVι,½) and (IV2,12) are bivariate data points sampled independently from the same population, and B[∙] denotes
expectation. In the case of (1), where W = X and V = V*(0), Somers’ D is the difference between two conditional probabilities.
Given an individual sampled from Population A and an individual sampled from Population B, these are the probability that the
individual from Population A has the higher Y* value and the probability that the individual from Population B has the higher
V* value. Somers’ D is therefore the parameter equal to zero under the null hypothesis tested by the “nonparametric” Wilcoxon
rank-sum test on V*(0). In the case where q = 0.5 (and therefore 1 — 2q = 0), a IOOgth percentile difference is known as a
median percentile difference and is zero under the null hypothesis tested by a Wilcoxon rank-sum test on Y.
Note that a value of θ satisfying (1) is not always unique. If Y has a discrete distribution, then there may be no solution or
a wide interval of solutions. However, the method used here is intended to produce a confidence interval containing any given
θ satisfying (1), with a probability at least equal to the confidence level, if such a θ exists.
We will assume that there are N1 observations sampled from Population A and V2 observations sampled from Population B,
giving a total of N1 + V2 = V observations. These observations will be identified by double subscripts, so that Y)j is the Y
value for the Jth observation sampled from the ith population (where г = 1 for Population A and г = 2 for Population B). The
corresponding X values (ones and zeros) will be denoted V„. The observations will be assumed to have importance weights