UNIT VI PHYLOGENETIC ANALYSIS AND TREE BUILDING


Introduction to phylogenetics,

 From the time of Charles Darwin, it has been the dream of many biologists to reconstruct the evolutionary history of all organisms on Earth and express it in the form of a phylogenetic tree. Phylogeny uses evolutionary distance, or evolutionary relationship, as a way of classifying organisms (taxonomy).
Phylogenetic relationship between organisms is given by the degree and kind of evolutionary distance. To understand this concept better, let us define taxonomy. Taxonomy is the science of naming, classifying and describing organisms. Taxonomists arrange the different organisms in taxa (groups). These are then further grouped together depending on biological similarities. This grouping of taxa reflects the degree of biological similarity.
Systematics takes taxonomy one step further by elucidating new methods and theories that can be used to classify species. This classification is based on similarity traits and possible mechanisms of evolution. In the 1950s, William Hennig, a German biologist, proposed that systematics should reflect the known evolutionary history of lineages, an approach he called phylogenetic systematics. Therefore, phylogenetic systematics is the field that deals with identifying and understanding the evolutionary relationships among many different kinds of organisms
Phylogenic relationships have been traditionally studied based on morphological data. Scientists used to examine different traits or characteristics and tried to establish the degree of relatedness between organisms. Then scientists realized that not all shared characteristics are useful in studying relationships between organisms. This discovery led to a study of systematics called cladistics. Cladistics is the study of phylogenetic relationships based on shared, derived characteristics. There are two types of characteristics, primitive traits and derived traits, which are described below.
Primitive traits are characteristics of organisms that were present in the ancestor of the group that is under study. They do not indicate anything about the relationships of species within a group because they are inherited from the ancestor to all of the members of the group. Derived traits are characteristics of organisms that have evolved within the group under study. These characteristics were not present in the ancestor. They are useful because they can help explain why some species have common traits. The most likely explanation for the presence of a trait that was not present in the ancestor of the whole group is that it evolved from a more recent ancestor.
Two extensive groups of analyses exist to examine phylogenetic relationships: Phenetic methods and cladistic methods. Phenetic methods, or numerical taxonomy, use various measures of overall similarity for the ranking of species. They can use any number or type of characters, but the data has to be converted into a numerical value. The organisms are compared to each other for all of the characters and then the similarities are calculated. After this, the organisms are clustered based on the similarities. These clusters are called phenograms. They do not necessarily reflect evolutionary relatedness. The cladistic method is based on the idea that members of a group share a common evolutionary history and are more closely related to members of the same group than to any other organisms. The shared derived characteristics are called synapomorphies.
The introduction of two important tools has dramatically improved the study of phylogenetics. The first tool is the development of computer algorithms capable of constructing phylogenetic trees. The second tool is the use of molecular sequence data for phylogenetic studies.

Phylogenetics can use both molecular and morphological data in order to classify organisms. Molecular methods are based on studies of gene sequences. The assumption of this methodology is that the similarities between genomes of organisms will help to develop an understanding of the taxonomic relationship among these species. Morphological methods use the phenotype as the base of phylogeny. These two methods are related since the genome strongly contributes to the phenotype of the organisms. In general, organisms with more similar genes are more closely related. The advantage of molecular methods is that it makes possible the study of genes without a morphological expression.
As previously mentioned, closely related species share a more recent common ancestor than distantly related species. The relationships between species can be represented by a phylogenetic tree. This is a graphical representation that has nodes and branches. The nodes represent taxonomic units. Branches reflect the relationships of these nodes in terms of descendants. The branch length usually indicates some form of evolutionary distance. The actual existing species called the operational taxonomic units (OTUs) are at the tip of the branches on the external nodes.
Tree construction methods
Some methods have been proposed for the construction of phylogenetic trees. They can be classified into two groups, the cladistic methods (maximum parsimony and maximum likelihood) and the phenetic method (distance matrix method).

Maximum parsimony trees imply that simple hypotheses are more preferable than complicated ones. This means that the construction of the tree using this method requires the smallest number of evolutionary changes in order to explain the phylogeny of the species under study. In the procedure, this method compares different parsimonious trees and chooses the tree that has the least number of evolutionary steps (substitutions of nucleotides in the context of DNA sequence).
Maximum likelihood This method evaluates the topologies of different trees and chooses the best based on a specified model. This model is based on the evolutionary process that can account for the conversion of one sequence into another. The parameter considered in the topology is the branch length.
Distance matrix is a phenetic approach preferred by many molecular biologists for DNA and protein work. This method estimates the mean number of changes (per site in sequence) in two taxa that have descended from a common ancestor. There is much information in the gene sequences that must be simplified in order to compare only two species at a time. The relevant measure is the number of differences in these two sequences, a measure that can be interpreted as the distance between the species in terms of relatedness.
Molecular phylogeny was first suggested in 1962 by Pauling and Zuckerkandl. They noted that the rates of amino acid substitution in animal hemoglobin were roughly constant over time. They described the molecules as documents of evolutionary history. The molecular method has many advantages. Genotypes can be read directly, organisms can be compared even if they are morphologically very different and this method does not depend on phenotype.
Phylogeny is currently used in many fields such as molecular biology, genetics, evolution, development, behaviour, epidemiology, ecology, systematics, conservation biology, and forensics. Biologists can infer hypotheses from the structure of phylogenetic trees and establish models of different events in evolutionary history. Phylogeny is an exceptional way to organize evolutionary information. Through these methods, scientists can analyse and elucidate different processes of life on Earth.
Today, biologists calculate that there are about 5 to 10 million species of organisms. Different lines of evidence, including gene sequencing, suggest that all organisms are genetically related and may descend from a common ancestor. This relationship can be represented by an evolutionary tree, like the Tree of Life. The Tree of Life is a project that is focused on understanding the origin of diversity among species using phylogeny.




Methods of phylogenetic analysis


Phylogenetic methods can be used for many purposes, including analysis of morphological and several kinds of molecular data. We concentrate here on the analysis of DNA and protein sequences.
Comparisons of more than two sequences
Analysis of gene families, including functional predictions
Estimation of evolutionary relationships among organisms
The basic concepts of phylogenetic analysis are quite easy to understand, but understanding what the results of the analysis mean, and avoiding errors of analysis can be quite difficult. For detailed coursework you can take mygraduate class on the topic.
COG analysis
A "quick and dirty" substitute for phylogenetic analysis
Using BLAST for multiple sequence comparisons
Emphasis is on reciprocal best hits, particularly among three genomes
This is probably an OK way to identify homologs, but it does not have the power of full phylogenetic analysis
Example with Everyday Objects
The basic model of phylogenetic analysis.
Nearly all methods of phylogenetic analysis share a number of fundamental assumptions. These include:
Homologous sequences are in a multiple sequence alignment.
• Note that homology is an a priori assumption of most phylogenetic methods. If homology is uncertain, then the analytical results should be interpreted with great caution.
The alignment is also referred to as a data matrix
Each column in the alignment is referred to as a character.
The specific residue (nucleotide or amino acid) present in a given sequence is referred to as the character state.
They are assumed to have been derived from a single common ancestor (this statement is actually redundant; by definition homologous sequences must be derived from a common ancestor).
In most cases ancestral sequences are not known, and the ancestral states must be inferred
The ancestral sequences are assumed to have undergone mutation
Modeling mutation accurately is one of the challenges of phylogenetic analysis
They are assumed to be related by a dichotomously branching tree
A priori assumptions include (but are not necessarily limited to):
Accuracy of sequence
That the sequence itself is correct
That it was determined from the correct organism
Violations of this assumption are more common than one might suspect. Several kinds of laboratory errors can result in incorrect annotation of an otherwise legitimate sequence.
That homology has been correctly determined. This applies to both the sequences themselves and the alignment.
Paralogy can cause tremendous confusion.
The assumptions that went into making the multiple sequence alignment are among the assumptions of the phylogenetic analysis that is based on that alignment.
That sufficient similarity remains among the sequences that there is usable phylogenetic information present.
The assumptions of phylogenetic analysis described above
Other critical considerations
The information content of the sequences
Invariant sequences
Saturated sequences
Assumptions particular to the analytical method (this will constitute much of our discussion for the next few lectures)
Markov Model
Note that even if a gene phylogeny is correctly inferred, that phylogeny may not be helpful. For example, because of paralogy, hybridization, introgression, and horizontal gene transfer, gene phylogenies do not always correspond to the phylogeny of the genome as a whole.
The data matrix
Characters
Character states
Multiple sequence alignments as data matrices
The importance of homology assessment
Phylogenetic methods can be divided into three general categories
Parsimony
Minimum Distance
Likelihood
Optimality criteria vs. tree-building algorithms
Parsimony
Part of a larger theoretical system refered to as "Cladistics"
Emphasises shared derived character states
The idea is that monophyletic groups can be recognized because they share derived character states ("synapomorphies").
Invariant, unique ("autapomorphic"), and ancestral character states are considered to be uninformative
Search for the tree that requires the smallest number of character-state changes

Determining the length of a tree

Minimum number of steps for a given character can be determined in one pass
We will look at a simple case with unordered characters
1.     Assign a state to each terminal node
2.     (2) Visit first internal node
1.     is the intersection of states non-empty?
1.     Yes: set internal state to this.
2.     Else:
1.     set the state to the smallest set containing the states of the daughter nodes
2.     increase the tree length by 1.
3.     Are you at the root of the tree?
1.     No: go to 2.
2.     Yes: go to 4.
4.     (4) Is the state at this node the same as the outgroup state?
1.     Yes: Proceed to the next character
2.     Else: Add one to the length of the tree; proceed to next character
This tells you the tree length, but does not map the characters onto the tree
Determining a most parsimonious reconstruction requires another pass
This reconstruction will not necessarily be unique!
The problem with uncorrected methods
Parsimony is easy to understand and can be a useful analytical method, but the method makes some assumptions that may not be immediately obvious. One of parsimony's most important assumptions is that it is relatively unusual for identical character-states to appear independently in different parts of the phylogenetic tree. In other words, it assumes that convergent evolution is a relatively rare phenomenon.
Unfortunately this is not a valid assumption for biological sequence data.
When the possible number of character states is limited, then one expects to observe convergent evolution. Because DNA has only four possible character states, two unrelated DNA sequences would be expected to have the same nucleotide present in roughly 25% of all positions. Two random aligned sequences would be expected to share somewhat more than 25% sequence identity (why?).
Because of this, under some conditions parsimony methods will be inconsistent
Although amino acid data have more character states than DNA and are therefore probably less
Models of DNA Sequence Evolution
Jukes-Cantor (JC)
All substitutions are equally likely
All nucleotides occur with equal frequency
Kimura Two Parameter (K2P)
Transitions and transversions can occur at different rates
All nucleotides occur with equal frequency
o         

A
C
G
T
A

Transversion
Transition
Transversion
C
Transversion

Transversion
Transition
G
Transition
Transversion

Transversion
T
Transversion
Transition
Transversion

In the evolution of real sequences transitions are typically observed more often than transversions.
Example of a substitution probability matrix consistent with the K2P model.

A
C
G
T
A
0.6
0.1
0.2
0.1
C
0.1
0.6
0.1
0.2
G
0.2
0.1
0.6
0.1
T
0.1
0.2
0.1
0.6
These values represent the probability of the corresponding event occurring within a unit of time, t.
The values in the diagonals are selected such that each row adds up to one. Each row has to add up to one because the substitution matrix takes into account all possible events within the model.
Felsenstein 1985 and Hasegawa, Kishino, and Yano, 1985 (F84/HKY85)
Transitions and transversions occur at different rates
The four nucleotides can occur with different frequencies
General Time Reversible
Each of the six possible substitutions occurs at a different rate, but rates are always symetrical, i.e., the rate for A being substituted by C is equal to the rate for being substituted by A.
Nucleotides can occur with different frequencies.
Modeling site-to-site rate variation
Invariant sites model
Gamma model
Minimum Distance
Pairwise distances can be aggregated into a phylogenetic tree
Search for the tree that minimizes discrepancies among pairwise distances
May or may not use an explicit model of sequence evolution
How the distances are calculated and how the tree is found can be mixed and matched
To know what method is being used, you have to know both how the distance matrix was constructed, and how the tree was determined
Likelihood
A model of sequence evolution can be used to relate the data to a hypothesis (typically a tree topology).
Maximum likelihood
Search for the tree that maximizes the likelihood function
The idea is to find the tree that is most likely given the data and the model
Bayesian analysis
Typically uses a Monte Carlo algorithm
Estimates probabilities for branch lengths and tree topologies


Properties of analytical methods
Consistency
A method is consistent if it is more likely to find the correct answer with more data.
Power
A method is powerful if it can find the correct answer with very few data.
Accuracy
A method is accurate if in multiple trials it produces answers that follow a normal distribution centered on the correct answer.
Precision
A method is precise if in multiple trials it finds answers that are very close to each other (i.e., have low variance).




Automated tools for phylogenetic analysis

TOPD/FMTS

It software to compare Phylogenetic trees. Puigbò Avalos et al of the Evolutionary Genomics Group, URV this tool to evaluate similarities and differences between phylogenetic trees. TOPD/FMTS can compare trees with leaf-sets that either completely or partially overlap and can be also used to compare two trees, one or both of which are multigene family trees. Implements the following methods to compare phylogenetic trees: Split distance, Nodal distance, Disagree, Taxa in common, Quartets, Triplets. Perl source of TOPD/FMTS is available precompiled for Linux, Windows and Macintosh OS's. <pubmed>17459965</pubmed>

TreeTop Phylogenetic tree prediction online server from the A.N.Belozersky Institute at the Russian EMBnet Node.


Phylip (Pasteur) Phylogeny : Phylip programs for inferring phylogenies.


PAUP

Phylogenetic Analysis Using Parsimony. Tools for inferring and interpreting phylogenetic trees. Analyze molecular sequences, morphological data, and other data types using maximum likelihood, parsimony and distance methods. PAUP* has many options and close compatibility with MacClade. It includes parsimony, distance matrix, invariants, and maximum likelihood methods and many indices and statistical tests<pubmed argument="PAUP">12116651,12116942</pubmed>


Treeview TreeView. Tree drawing software for Apple Macintosh and Windows (and now Linux and Unix. ) TreeView is a simple program for displaying phylogenies on Apple Macintosh and Windows PCs. It has the following features:. runs on both the Apple ...



TreeGen Tree generation from distance data.



Phylodendron Phylodendron Phylogenetic tree drawing. Phylodendron Web Form. To generate a tree drawing from this server, fill out the form. for samples, see Sample 1. and Sample 2. or see Prokaryote. , Eukaryote. , and Mitochondrial. representative ...



QuickTree Rapid reconstruction of phylogenies by the Neighbor-Joining Method.


Phylip PHYLIP. PHYLIP is a free package of programs for inferring phylogenies. It is distributed as source code, documentation files, and a number of different types of executables. These Web pages, by Joe Felsenstein. of the Department of Genome ...



Tree-Puzzle TREE-PUZZLE reconstructs phylogenetic trees from molecular sequence data by maximum likelihood. It implements a fast tree search algorithm, quartet puzzling, allowing analysis of large data sets, automatically assigning estimations of support to each internal branch. It computes pairwise maximum likelihood distances as well as branch lengths for user specified trees. TREE-PUZZLE uses likelihood mapping to investigate the support of a hypothesized internal branch without computing an overall tree and to visualize the phylogenetic content of a sequence alignment. TREE-PUZZLE also conducts a number of statistical tests on the data set, and can use several substitution models.<pubmed argument="Tree-Puzzle">11934758</pubmed>





FastDNAml fastDNAml. : construction of phylogenetic trees of DNA sequences using maximum likelihood (Olsen, Matsuda, Hagstrom, Overbeek. ). your e-mail ( = conditionally required) frequency of A (instead of empirical frequencies) frequency of G ...



TNT Tree Analysis Using New Technology. Analyses large data sets (i.e. 300-500 taxa) in reasonable times (minutes to find a shortest tree, hours to produce a reliable consensus.


DAMBE Data Analysis in Molecular Biology and Evolution. General-purpose package for DNA and protein sequence phylogenies, and also gene frequencies. It can read and convert a number of file formats, and has many features for descriptive statistics. It can compute a number of commonly-used distance matrix measures and infer phylogenies by parsimony, distance, or likelihood methods, including bootstrapping (by sites or by codons) and jackknifing. There are a number of kinds of statistical tests of trees available. It can also display phylogenies.<pubmed argument="DAMBE">11535656</pubmed>


TreeFinder Computes phylogenetic trees from DNA sequences. A very fast maximum likelihood program by Gangolf Jobb. TREEFINDER computes phylogenetic trees from nucleotide sequences. Using the widely accepted Maximum Likelihood method, it offers a variety of evolutionary models up to the general time reversible model (GTR) with Gamma distributed rates among sites. All model parameters including the rate heterogeneity can be estimated from the data. For protein-coding sequences, one can in addition assume and estimate individual codon position rates. A genetic tree search algorithm explores the tree space for the likeliest trees. Its exhaustiveness is accomodable to the users patience.


Phase Software package for PHylogenetics And Sequence Evolution, specifically designed for Maximum-likehood and Bayesian phylogenetic inference with RNA sequences that have a conserved secondary structure. PHASE is a package that performs molecular phylogenetic inference. The software seeks to accurately compare molecular sequences to determine the likely evolutionary relationships between a group of species. This package is designed specifically for use with RNA sequences that have a conserved secondary structure, e.g.,rRNA and tRNA. Most phylogenetic programs assume that each site in a molecule evolves independently of the others but this assumption is not valid for RNA genes. Standard Maximum Likelihood techniques and Bayesian methods (using MCMC) are implemented for inferring the optimal tree with any of the DNA or RNA evolution models. Features include: Bayesian estimation of phylogenies and substitution model parameters, standard ML search algorithms for inferring the optimal tree with optional topology constraints, 6, 7 and 16 state RNA models, standard 4 state DNA models, invariant and discrete gamma model for substitution rate heterogeneity between sites, and mixing of molecular data types in a single analysis.<pubmed argument="Phase">12878461,12200486</pubmed>


Jevtrace Jevtrace allows detailed mining and graphical visualization. Jevtrace is the Java implementation of the Evolutionary Trace (ET) method. It expands on the ET by allowing interaction with the underlying data, analysis and results. The software includes a graphical interface integrating multiple sequence alignment, phylogeny and protein structure. Interaction with protein structures is performed through WebMol by interactively mapping sequence selections onto protein structures. The software runs on Linux, Windows, Macintosh, SGI.<pubmed argument="Jevtrace ">12537566 </pubmed>


ArboDraw - creates and saves images of phylogenetic trees Build and display phylogenetic trees of protein sequences. Display dendrogram files in Newick format (generated by MUSCLE, Phylip, ClustalW). Edit, annotate, and colour various parts of the tree, and save your work.



ModelGenerator: amino acid and nucleotide substitution model selection ModelGenerator is a model selection program that generates optimal amino acid and nucleotide substitution models from Fasta or Phylip alignments. ModelGenerator supports 56 nucleotide and 80 amino acid substitution models.






Role of multiple sequence alignment algorithms in phylogenetic analysis

Construction of phylogenetic tree


( for these topics - material refer to the following URL)