Allan Wilson Centre for Molecular Ecology and Evolution


Theoretical Background

Index:

Compatibility matrices
Lento-Plots
Hadamard Conjugation/Spectral Analysis
Median Networks
Reduced Median Networks
References
Back to top


Compatibility Matrices

In Jakobsen et.al. 1996 compatibility matrices were introduced for the study of reticulate evolution. The underlying idea is straighforward but nonetheless powerful. For every two parsimonious characters in a pre-given alignment, record in a dot-plot whether they are incompatible (i.e. conflicting) or compatible (not conflicting). Using two distinct colors (in our case black and white) for the two situations the recording is done in a symmetric matrix with the rows and columns successively labeled by the parsimonious sites of the alignment.

The figure above shows a compatibility matrix for the plant data set that was studied in Huber et.al. 2001(a). A black square indicates that the characters labeling the row and the column that intersect in this square are incompatible and a white one that they are compatible. Resting the mouse cursor over a square highlights the site numbers of the two splits for that square.


Lento-Plots

Each split has both a support value (e.g. the number of sites which induce the split) and a conflict value (e.g. the sum of the support for every split incompatible with the chosen split). Lento et.al. 1995 introduced a useful graphical representation of these values, where splits are identified on the X-axis, with a bar above the axis representing the value of the support for the split, and the bar below the axis representing the value of the conflict. The splits are ordered by their support values, decreasing from the split with maximal value. When the number of splits is large, a threshold is used so that only those splits with support greater than the threshold are displayed. The scales above and below the X-axis are weighted so that the areas above and below are equal, but the user can adjust the scale by clicking and dragging.

The figure above shows the Lento-Plot for the provided plant data set from Huber et.al. 2001(a)

The figure above shows the Lento-Plot for the provided viral DNA data set where the collection of splits was obtained via direct encoding.

The figure above shows the Lento-Plot for the provided viral DNA data set where the collection of splits was obtained via Hadamard conjugation.

In the Lento-plots displayed above, each split is identified by dots corresponding to the sequences of the smaller of the split subsets, as identified on the right of the plot. (However note in these displays, not all sequences are identified within the display window.) As the singleton splits cannot have any conflict, they have no bar below the X-axis. If the user wishes to infer an evolutionary tree, then a greedy approach is to select splits from the Lento-plot starting from the left, and adding each successive split which is compatible with splits already selected. (This process will identify at most each of the n singletons and n-3 informative sites, where n is the number of sequences.) The selection should also be guided by the conflict values, so that in the plant data set, the splits ranked 11, 12 and 15 may be rejected in favour of the adjacent splits with smaller conflict values.


Hadamard Conjugation/Spectral Analysis

When sequences are derived from an evolutionary tree T, according to some model of nucleotide point substitution, then there can be an expected occurrence of splits in the observed sequence which are not splits of T. Apart from sampling error, there is also a consistent pattern of splits due to parallel and multiple substitutions at a given site. (These effects give rise to the "inconsistency" problem with Maximum Parsimony (Felsenstein 1981) and some other tree building algorithms.) For the symmetric 2ST and 3ST substitution models of Kimura (Kimura 1980, 1981) and the simpler one parameter model of Jukes and Cantor 1969 these effects can be corrected for using Hadamard conjugation (Hendy and Penny, 1989, Hendy et.al. 1994, Steel et.al. 1998). Hadamard conjugation estimates the historical number of substitutions from the observed difference between the sequences. It does not require knowledge of T (but this can be inferred from the data), but it does require the full sequence data, that is including the number of invariant and singleton sites. If the invariant and singleton sites are not presented then an error message (indicating negative entries for a log function call) will result.

The Hadamard conjugation is an exponential order algorithm (values must be computed for each of the potential 2n-1 splits among n sequences before inversion can occur) so it is practical only for analyses of up to around n=25 sequences.

After Hadamard conjugation, the support values for each split are estimates of expected numbers of substitutions, most splits (which probably didn't occur) will be represented by values very close to 0. The values which have support below a threshold (the default value corresponds to the expected number of 0.1 substitutions across the full sequences) do not get reported. The corrected values should in general be close to the observed values, however there can be some significant changes in some of them, which can affect the ranking of the splits in a Lento-plot. If the data has been derived from an evolutionary tree then we generally find the corrected data is more tree-like.


Median Networks

Although median networks have been known in the subjects of Discrete Mathematics and algebra for many years, they were introduced as a tool for phylogenetic analysis by H.-J. Bandelt only in 1994. Since then, they have been successfully used in the analysis of human mitochondrial data (c.f. Bandelt et.al. 1995, Saillard et.al. 2000, Watson et.al. 1997) and chloroplast data (Huber et.al. 2001(a)).

The underlying idea of a median network is to generate for every three taxa, represented as finite binary sequences, a hypothetical ancestral sequence, called their median, by applying the majority rule to every column of the three aligned sequences (once constant columns and multiple copies of columns that give rise to the same split, i.e. bipartition, of the taxa set have been removed). For example after aligning the the three sequences 011, 101, and 110 as done below, their median can be quickly determined as being the sequence 111.

                 011
                101
               110
median: 111

Clearly, the median is always a binary sequence and is always unique. Moreover it might be that one of the taxa in the triplet is the median of the triplet. For example, the median of the sequences 110, 111, and 011 is the sequence 111.

Now given a taxa set X. The median closure of X is obtained by successively associating to every three sequences in X their median m and then enlarging X by m. Repeated application of this process, i.e. taking "medians of medians" and enlarging X by the obtained sequences, yields, after a finite number of steps, the median closure of X. For example, the median closure of the four sequences 000, 110, 101, and 110 is the set consisting of all eight binary sequences of length 3.

The median network generated by a finite set X of taxa is a graph with vertex set the median closure of X and every two vertices joined by an edge precisely if they differ in just one position. (c.f. Huber et.al. 2001(a) and Bandelt et.al. 1995). For example, the median network generated by the four sequences 000, 110, 101, and 110 is the three dimensional cube.

For two extreme cases the structure of median networks is well understood. Given an alignment as described above (i.e. an alignment of binary sequences with no constant column and all multiple copies of columns which induce the same split of X removed), the associated median network is a tree if and only if only three out of the four possible combinations 11,10,01,00 occur in every pair of columns in the alignment, whereas it is a hypercube of dimension the number of columns in the alignment if and only if all four combinations occur in every pair of columns in the alignment.

Although the method described above is conceptually easy to understand, the algorithmically preferable approach for constructing median networks is via median expansion. (cf. Bandelt et.al. 1995). This construction is implemented in the package.

Note that every split of a collection of splits uniquely corresponds to a class of parallel edges in the median network associated to this collection. See Huber et.al. 2001(a) for more details.



The figure shows the median network of the provided plant data set which appeared in Huber et.al. 2001(a). All 18 splits, which were generated form the alignment using the direct encoding option, are represented by the network.


Reduced Median Networks

Although theoretically extremely attractive, median networks suffer from the fact that when applied to data sets with many conflicting signals the associated median network is a highly complex structure containing high dimensional reticulations. To cope with this problem a number of thinning techniques have been proposed by various authors (Bandelt et.al. 1995, Huber et.al. 2001(a), Huber et.al. 2001(b)).

In the first mentioned approach reticulations are resolved based on haplotype frequency and character weight arguments and this technique has been implemented in the package. In the future, the other equally attractive alternatives will be implemented.



The figure shows a reduced median network for the plan data set that appeared in Huber et.al. 2001(a). The reduction yielded 16 splits and all of them are represented by the network.


References

Bandelt, H.-J. (1994) Phylogenetic networks. Verhandl. Naturwiss. Vereins Hamburg (NF) 34:51-71.

Bandelt, H.-J., Forster, P., Sykes, B.C., and Richards, M.B. (1995) Mitochondrial portraits of human population using median networks. Genetics, 141, 743-753.

Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17: 368-376.

Hendy, M.D., and Penny, D. (1989) A framework for the quantitative study of evolutionary trees. Syst.Zool., 38, 297-309.

Hendy, M.D., and Penny, D. (1993). Spectral analysis of phylogenetic data, J. Classif., 10, 5-24.

Hendy, M.D., Penny, D., and Steel, M.A. (1994) Discrete Fourier analysis for evolutionary trees, Proc. Nat. Acad. Sci. (USA), 91, 3339-3343.

Huber, K.T., Langton, M., Penny, D., Moulton, V., Hendy, M.. Spectronet: a package for computing spectra and median networks, submitted.

Huber, K.T., Moulton, V., Lockhart, P. and Dress, A. (2001)(a) Pruned median networks: a technique for reducing the complexity of median networks. Molecular Phylogenetics and Evolution 19:302-310.

Huber, K.T., Watson E.E., Hendy, M. (2001)(b) An algorithm for constructing local regions in a phylogentic network. Molecular Phylogenetics and Evolution, 19:1-8.

Jakobsen I.B., and Easteal S. (1996) A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences, CABIOS, 12, 291-295.

Jukes T., and Cantor, C. (1969) Evolution of protein molecules. In Mammalian Protein Metabolism (Munro, H. ed) pp. 21-132, New York: Academic Press

Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 17, 111-120.

Kimura, M. (1981) Estimation of evolutionary sequences between homologous nucleotide sequences, Proc. Nat. Acad. Sci. (USA), 78, 454-458.

Lento, G.M., Hickson, R.E., Chambers G.K., and Penny, D. (1995) Use of spectral analysis to test hypotheses on the origin of pinninpeds. J. Mol. Biol. Evol., 12, 28-52.

Saillard J., Forster P., Lynnerup N., Bandelt H.-J., Nørby S. (2000) mtDNA variation among Greenland Eskimos: the edge of the Beringian expansion. Am J Hum Genet 67:718-726.

Steel, M.A., Hendy, M.D., and Penny, D. (1998) Reconstructing phylogenies from nucleotide probabilities - a survey and some new results. Discr. Appl. Math., 88, 367-396.

Watson E., Forster P., Richards M. B., Bandelt H.-J. (1997) Mitochondrial footprints of human expansions in Africa. American Journal of Human Genetics 61:691-704.


Site Map Site Search E-mail Webmaster