 |
 |
 |
 |
Allan Wilson Centre for Molecular Ecology and Evolution
|
Theoretical Background
Index:
Compatibility matrices
Lento-Plots
Hadamard Conjugation/Spectral Analysis
Median Networks
Reduced Median Networks
References
Back to top
Compatibility Matrices
In Jakobsen et.al. 1996 compatibility matrices
were introduced for the study of reticulate evolution. The underlying
idea is straighforward but nonetheless powerful. For every two parsimonious
characters in a pre-given alignment, record in a dot-plot whether they
are incompatible (i.e. conflicting) or compatible (not conflicting). Using
two distinct colors (in our case black and white) for the two situations
the recording is done in a symmetric matrix with the rows and columns
successively labeled by the parsimonious sites of the alignment.
The figure above shows a compatibility matrix for the plant data set that
was studied in Huber et.al. 2001(a). A black square
indicates that the characters labeling the row and the column that intersect
in this square are incompatible and a white one that they are compatible.
Resting the mouse cursor over a square highlights the site numbers of
the two splits for that square.
Lento-Plots
Each split has both a support value (e.g. the number of sites which
induce the split) and a conflict value (e.g. the sum of the support for
every split incompatible with the chosen split). Lento
et.al. 1995 introduced a useful graphical representation of these
values, where splits are identified on the X-axis, with a bar above the
axis representing the value of the support for the split, and the bar
below the axis representing the value of the conflict. The splits are
ordered by their support values, decreasing from the split with maximal
value. When the number of splits is large, a threshold is used so that
only those splits with support greater than the threshold are displayed.
The scales above and below the X-axis are weighted so that the areas above
and below are equal, but the user can adjust the scale by clicking and
dragging.
The figure above shows the Lento-Plot for the provided plant data set
from Huber et.al. 2001(a)
The figure above shows the Lento-Plot for the provided viral DNA data
set where the collection of splits was obtained via direct encoding.
The figure above shows the Lento-Plot for the provided viral DNA data
set where the collection of splits was obtained via Hadamard conjugation.
In the Lento-plots displayed above, each split is identified by dots corresponding
to the sequences of the smaller of the split subsets, as identified on
the right of the plot. (However note in these displays, not all sequences
are identified within the display window.) As the singleton splits cannot
have any conflict, they have no bar below the X-axis. If the user wishes
to infer an evolutionary tree, then a greedy approach is to select splits
from the Lento-plot starting from the left, and adding each successive
split which is compatible with splits already selected. (This process
will identify at most each of the n singletons and n-3 informative
sites, where n is the number of sequences.) The selection should
also be guided by the conflict values, so that in the plant data set,
the splits ranked 11, 12 and 15 may be rejected in favour of the adjacent
splits with smaller conflict values.
Hadamard Conjugation/Spectral Analysis
When sequences are derived from an evolutionary tree T, according
to some model of nucleotide point substitution, then there can be an expected
occurrence of splits in the observed sequence which are not splits of
T. Apart from sampling error, there is also a consistent pattern of splits
due to parallel and multiple substitutions at a given site. (These effects
give rise to the "inconsistency" problem with Maximum Parsimony (Felsenstein
1981) and some other tree building algorithms.) For the symmetric
2ST and 3ST substitution models of Kimura (Kimura 1980,
1981) and the simpler one parameter model of Jukes
and Cantor 1969 these effects can be corrected for using Hadamard
conjugation (Hendy and Penny, 1989, Hendy
et.al. 1994, Steel et.al. 1998). Hadamard conjugation
estimates the historical number of substitutions from the observed difference
between the sequences. It does not require knowledge of T (but this can
be inferred from the data), but it does require the full sequence data,
that is including the number of invariant and singleton sites. If the
invariant and singleton sites are not presented then an error message
(indicating negative entries for a log function call) will result.
The Hadamard conjugation is an exponential order algorithm (values must
be computed for each of the potential 2n-1 splits among
n sequences before inversion can occur) so it is practical only
for analyses of up to around n=25 sequences.
After Hadamard conjugation, the support values for each split are estimates
of expected numbers of substitutions, most splits (which probably didn't
occur) will be represented by values very close to 0. The values which
have support below a threshold (the default value corresponds to the expected
number of 0.1 substitutions across the full sequences) do not get reported.
The corrected values should in general be close to the observed values,
however there can be some significant changes in some of them, which can
affect the ranking of the splits in a Lento-plot. If the data has been
derived from an evolutionary tree then we generally find the corrected
data is more tree-like.
Median Networks
Although median networks have been known in the subjects of Discrete
Mathematics and algebra for many years, they were introduced as a tool
for phylogenetic analysis by H.-J. Bandelt only in
1994. Since then, they have been successfully used in the analysis of
human mitochondrial data (c.f. Bandelt et.al. 1995,
Saillard et.al. 2000, Watson
et.al. 1997) and chloroplast data (Huber et.al.
2001(a)).
The underlying idea of a median network is to generate for every three
taxa, represented as finite binary sequences, a hypothetical ancestral
sequence, called their median, by applying the majority rule to every
column of the three aligned sequences (once constant columns and multiple
copies of columns that give rise to the same split, i.e. bipartition,
of the taxa set have been removed). For example after aligning the the
three sequences 011, 101, and 110 as done below, their median can be quickly
determined as being the sequence 111.
011
101
110
median: 111
Clearly, the median is always a binary sequence and is always unique.
Moreover it might be that one of the taxa in the triplet is the median
of the triplet. For example, the median of the sequences 110, 111, and
011 is the sequence 111.
Now given a taxa set X. The median closure of X is obtained by successively
associating to every three sequences in X their median m and then enlarging
X by m. Repeated application of this process, i.e. taking "medians of
medians" and enlarging X by the obtained sequences, yields, after a finite
number of steps, the median closure of X. For example, the median closure
of the four sequences 000, 110, 101, and 110 is the set consisting of
all eight binary sequences of length 3.
The median network generated by a finite set X of taxa is a graph with
vertex set the median closure of X and every two vertices joined by an
edge precisely if they differ in just one position. (c.f. Huber
et.al. 2001(a) and Bandelt et.al. 1995). For
example, the median network generated by the four sequences 000, 110,
101, and 110 is the three dimensional cube.
For two extreme cases the structure of median networks is well understood.
Given an alignment as described above (i.e. an alignment of binary sequences
with no constant column and all multiple copies of columns which induce
the same split of X removed), the associated median network is a tree
if and only if only three out of the four possible combinations 11,10,01,00
occur in every pair of columns in the alignment, whereas it is a hypercube
of dimension the number of columns in the alignment if and only if all
four combinations occur in every pair of columns in the alignment.
Although the method described above is conceptually easy to understand,
the algorithmically preferable approach for constructing median networks
is via median expansion. (cf. Bandelt et.al. 1995).
This construction is implemented in the package.
Note that every split of a collection of splits uniquely corresponds to
a class of parallel edges in the median network associated to this collection.
See Huber et.al. 2001(a) for more details.
The figure shows the median network of the provided plant data set which
appeared in Huber et.al. 2001(a). All 18 splits,
which were generated form the alignment using the direct encoding option,
are represented by the network.
Reduced Median Networks
Although theoretically extremely attractive, median networks suffer
from the fact that when applied to data sets with many conflicting signals
the associated median network is a highly complex structure containing
high dimensional reticulations. To cope with this problem a number of
thinning techniques have been proposed by various authors (Bandelt
et.al. 1995, Huber et.al. 2001(a), Huber
et.al. 2001(b)).
In the first mentioned approach reticulations are resolved based on haplotype
frequency and character weight arguments and this technique has been implemented
in the package. In the future, the other equally attractive alternatives
will be implemented.
The figure shows a reduced median network for the plan data set that appeared
in Huber et.al. 2001(a). The reduction yielded 16
splits and all of them are represented by the network.
References
Bandelt, H.-J. (1994) Phylogenetic networks. Verhandl.
Naturwiss. Vereins Hamburg (NF) 34:51-71.
Bandelt, H.-J., Forster, P., Sykes, B.C., and Richards,
M.B. (1995) Mitochondrial portraits of human population using median networks.
Genetics, 141, 743-753.
Felsenstein, J. (1981) Evolutionary trees from DNA sequences:
a maximum likelihood approach. J. Mol. Evol., 17: 368-376.
Hendy, M.D., and Penny, D. (1989) A framework for
the quantitative study of evolutionary trees. Syst.Zool., 38, 297-309.
Hendy, M.D., and Penny, D. (1993). Spectral analysis
of phylogenetic data, J. Classif., 10, 5-24.
Hendy, M.D., Penny, D., and Steel, M.A. (1994) Discrete
Fourier analysis for evolutionary trees, Proc. Nat. Acad. Sci. (USA),
91, 3339-3343.
Huber, K.T., Langton, M., Penny, D., Moulton, V.,
Hendy, M.. Spectronet: a package for computing spectra and median networks,
submitted.
Huber, K.T., Moulton, V., Lockhart, P. and Dress, A.
(2001)(a) Pruned median networks: a technique for reducing the complexity
of median networks. Molecular Phylogenetics and Evolution 19:302-310.
Huber, K.T., Watson E.E., Hendy, M. (2001)(b) An algorithm
for constructing local regions in a phylogentic network. Molecular Phylogenetics
and Evolution, 19:1-8.
Jakobsen I.B., and Easteal S. (1996) A program for calculating
and displaying compatibility matrices as an aid in determining reticulate
evolution in molecular sequences, CABIOS, 12, 291-295.
Jukes T., and Cantor, C. (1969) Evolution of protein
molecules. In Mammalian Protein Metabolism (Munro, H. ed) pp. 21-132,
New York: Academic Press
Kimura, M. (1980) A simple method for estimating evolutionary
rates of base substitutions through comparative studies of nucleotide
sequences. J. Mol. Evol., 17, 111-120.
Kimura, M. (1981) Estimation of evolutionary sequences
between homologous nucleotide sequences, Proc. Nat. Acad. Sci. (USA),
78, 454-458.
Lento, G.M., Hickson, R.E., Chambers G.K., and
Penny, D. (1995) Use of spectral analysis to test hypotheses on the origin
of pinninpeds. J. Mol. Biol. Evol., 12, 28-52.
Saillard J., Forster P., Lynnerup N., Bandelt H.-J.,
Nørby S. (2000) mtDNA variation among Greenland Eskimos: the edge of
the Beringian expansion. Am J Hum Genet 67:718-726.
Steel, M.A., Hendy, M.D., and Penny, D. (1998) Reconstructing
phylogenies from nucleotide probabilities - a survey and some new results.
Discr. Appl. Math., 88, 367-396.
Watson E., Forster P., Richards M. B., Bandelt H.-J.
(1997) Mitochondrial footprints of human expansions in Africa. American
Journal of Human Genetics 61:691-704.
|