12
Stata Technical Bulletin
STB-8
“copied-back” the information for length 37 from length 38 whereas nlsm more agnostically filled in missing (applying two
even-span smoothers results in shifting the data one unit forward, so information for the first observation is lost).
I do not have an explanation for the remaining three differences except to assert that the results reported by nlsm are as
intended, which is not to say that they are necessarily more correct. There is obviously a difference in assumptions about how
the start-up tail is to be handled between the two routines although, interestingly, that difference is not reflected in how the
trailing tail is handled. (Not too much should be made of that, however. Define the function rev() as the function reversing a
sequence, e.g., rev(j⅛) = yjv-j+ι. Let SQ be some smoother. One is tempted to think that SQ/Q = rev(S'(rev(yt))). That
is true for median smoothers of odd span, the Hanning smoother, and the end-point rule. It is not, however, true for median
smoothers of even span.)
In any case, the tails produced by any of these smoothers should not be taken too seriously—they are based on too little
data and too many approximations and fix-up rules. The purpose of the smoother is to reveal the pattern for the middle-portions
of the data.
References
Salgado-Ugarte, I. H. and J. C. Garcia. 1992. sed7: Resistant smoothing using Stata. Stata Technical Bulletin 7: 8-11.
Tukey, J. W. 1977. Exploratory Data Analysis, Ch. 7. Reading, MA: Addison-Wesley Publishing Company.
Velleman, P. F. 1977. Robust nonlinear data smoothers: Definitions and recommendations. Proc. Natl. Acad. Sci. USA 74(2): 434-436.
——. 1980. Definition and comparison of robust nonlinear data smoothing algorithms. Journal of the American Statistical Association 75(371): 609-615.
sg1.3 Nonlinear regression command, bug fix
Patrick Royston, Royal Postgraduate Medical School, London, FAX (011)-44-81-740 3119
nlpred incorrectly calculates the predictions and residuals when nl is used with the Inlsq (log least squares) option. The
bug is fixed when the update on the STB-8 diskette is installed. nlpred is used after nl to obtain predicted values and residuals
much as predict is used after regress or fit. The mistake affected only calculations made when the log least squares option
was specified during estimation.
sg7 Centile estimation command
Patrick Royston, Royal Postgraduate Medical School, London, FAX (011)-44-81-740 3119
Stata’s summarize, detail command supplies sample estimates of the 1, 5, 10, 25, 50, 75, 90, 95 and 99th (per)centiles.
To extend summarize, I provide an ado-file for Stata version 3.0 which estimates arbitrary centiles for one or more variables
and calculates confidence intervals, using a choice of methods.
The syntax of centile is
centile VvaiistQ [if exp [in range] [, centile(# [#...]) cci normal meansd level(#) ]
The gth centile of a continuous random variable X is defined as the value of Cq which fulfills the condition P(X < Cq) =
g∕100. The value of q must be in the range 0 < q < 100, though q is not necessarily an integer. By default, centile estimates
C,g for the variables in varlist and for the value(s) of q given in centile(#...). It makes no assumptions as to the distribution
of X and, if necessary, uses linear interpolation between neighboring sample values. Extreme centiles (for example, the 99th
centile in samples smaller than 100) are fixed at the minimum or maximum sample value. An ‘exact’ confidence interval for Cq
is also given, using the binomial-based method described below (see Formulæ). The detailed theory is given by Conover (1980,
111-116). Again, linear interpolation is employed to improve the accuracy of the estimated confidence limits, but extremes are
fixed at the minimum or maximum sample value.
You can prevent centile from interpolating when calculating binomial-based confidence intervals by specifying the
conservative confidence interval option cci. The resulting intervals are in general wider than with the default, that is, the
coverage (confidence level) tends to be greater than the nominal value (given as usual by level(#), by default 95%).
If the data are believed to be normally distributed (a common case), two alternate methods for estimating centiles are offered.
If normal is specified, Cq is calculated as just described, but its confidence interval is based on a formula for the standard
error (s.e.) of a normal-distribution quantile given by Kendall and Stuart (1969, 237). If meansd is alternatively specified, Cq
is estimated as x + zq × s, where x and a are the sample mean and standard deviation and zq is the gth centile of the standard
normal distribution (e.g. Z95 = 1.645). The confidence interval is derived from the s.e. of the estimate of Cq.