Stata Technical Bulletin
sbe38 Haplotype frequency estimation using an EM algorithm and log-linear modeling
Adrian Mander, MRC Biostatistics Unit, Cambridge, UK, [email protected]
Abstract: This function estimates allele/haplotype frequencies under a log-linear model when phase is unknown. Different
log-linear models are compared using a likelihood-ratio test allowing tests for linkage disequilibrium and disease association.
These tests can be adjusted for possible confounders in a stratified analysis.
Keywords: Haplotypes, alleles, association studies, stratified analysis, phase unknown, log-linear modeling.
Syntax
hapipf varlist [using exp] [if exp] [, ldim(αrrlist) display ipf (str) start known
phaseVaraaame) acc(#) ipfacc(#) nolog model(#) Irtest(#,#)
convars (str) conf ile (str) ]
Description
This function calculates allele/haplotype frequencies using log-linear modeling embedded within an EM algorithm. The
EM algorithm handles the phase uncertainty and the log-linear modeling allows testing for linkage disequilibrium and disease
association. These tests can be controlled for confounders using a stratified analysis specified by the log-linear model. The
log-linear model can also model the relationship between loci and hence can group similar haplotypes.
The log-linear model is fitted using iterative proportional fitting which is implemented in the ipf command introduced in
Mander (2000). Note that before hapipf can execute, the ipf command must be installed. This algorithm can handle very large
contingency tables and converges to maximum likelihood estimates even when the likelihood is badly behaved.
The aarlist consists of paired variables representing the alleles at each locus. If phase is known, then the pairs are the
genotypes. When phase is unknown the algorithm assumes Hardy-Weinberg Equilibrium, so models are based on chromosomal
data and not genotypic data.
Options
Idim Vanι^list) specifies the variables that determine the dimension of the contingency table. By default the variables contained
in the ipf option define the dimension.
display specifies whether the expected and imputed haplotype frequencies are shown on the screen.
ipf (str) specifies the log-linear model. It requires special syntax of the form 11*12+13. This model makes the third locus
independent of the first two and includes the interaction between the first and second locus.
start specifies that the starting posterior weights of the EM algorithm are chosen at random.
known specifies that phase is known.
phase Vanraame) specifies a variable that contains 1’s where phase is known and 0’s where phase is unknown.
acc(#) specifies the convergence criteria based on the log likelihood.
ipfacc(#) specifies the convergence criteria for the iterative proportional fitting algorithm.
nolog specifies whether the log likelihood is displayed at each iteration.
model (#) specifies a label for the log-linear model being fitted. This label is used in the Irtest option.
Irtest (#,#) performs a likelihood-ratio test using two models that have been labeled by the model option.
convars (str) specifies a list of variables in the constraints file.
confile (str) specifies the name of the constraints file.
Examples
Data are taken from Sham (1998) that consist of two loci (a and b), case-control status (D) and one stratifying variable
(S). The first few lines of this dataset are shown below.