As mentioned in the previous entry, one of the limitations hindering the existence of a comprehensive tree of life available to the public is that information is widely scattered into thousands of articles from scientific journals in the form of figures (cladograms, phylograms, ultrametric trees, etc). Each of those figures represents only a small fraction of the whole Tree of Life, so it is necessary to assemble those fractions into a single tree structure.
Although all this may sound as simple as assembling a jigsaw puzzle, the process is much more complex in practice. This is the first of a series of blog posts that will be dedicated to describe this complex process in which the synthetic tree seen in the app was built from hundreds of smaller trees found in the scientific literature. As a reminder, all the references consulted for phylogenetic information of different parts of the tree of life are listed in the references section of this site. You will note that each reference has a comment indicating to which part of the tree of life it is contributing information. You can use the good ol’ Ctrl+F find function to look for a specific group of organisms by scientific name (i.e: Mammalia, Chondrichthyes, Bacteria) and then find the articles that are backing up the topology (or divergence dates) of those specific fractions of the tree of life.
Tree file format:
The app draws the shape of the tree of life by reading a simple newick tree file. This newick file is not generated automatically as an output file by any of the common software for phylogenetic inference. It is rather manually written directly after the printed tree figures of scientific articles after being assembled and/or altered by the procedures that will be described below.
One of the advantages of manually creating our own tree files is that we are not dependent on the tree files that were generated by the original studies: as the majority of authors do not deposit their trees in databases such as TreeBASE or DataDryad, most of the raw phylogenetic data is effectively lost and other projects such as OpenTree cannot incorporate such information into their synthetic tree. But because we rely on the figures rather than raw data, we can guarantee that we are including all the pertinent phylogenetic knowledge available in the scientific literature regardless if it is stored in databases or not.
INTEGRATION OF PHYLOGENETIC INFORMATION
The problem with conflicting results
Displaying a single tree that summarizes the current knowledge on evolutionary relationships among organism is not an easy task, mainly because it is common to find more than one study that reconstructs the phylogeny of the same taxonomic group leading to different topological results. When a situation like that is encountered, it is necessary to acknowledge that different methods of phylogenetic inference and types of data (molecular or morphological) do not accurately reconstruct phylogenies equally well. Because of this, priorities must be formally recognized to standardize the method of study selection for the reconstruction of the tree of life presented in the app. In this case, some criteria were applied independently for (a) datasets and (b) tree reconstruction method (Figure 2).
-Regarding the type of data
This tree of life intends to summarize current knowledge on evolutionary history of organisms coming from molecular phylogenetics and as such, molecular datasets are given priority over morphological ones. The advantages of molecular characters over morphological characters have been demonstrated, and although this issue has been heatedly debated over two decades, the scope of this entry is not to discuss this controversial topic, but we could leave it for a future blog post.
Within molecular data, a further distinction was made between mitochondrial datasets and nuclear datasets. Mitochondrial sequences are widely used in many phylogenetic studies mainly because they are easily and cheaply acquired, but their faster substitution rates makes them unsuitable for reconstructing deep phylogenetic relationships due to saturation (Brown, 1979). Because the tree of life displayed in the app is family-level, phylogenetic trees reconstructed by nuclear sequences are given priority over mitochondrial sequences for being more accurate in reconstructing deep phylogenetic relationships (Moore, 1995). Moreover, the presence of introgression and incomplete lineage sorting may further complicate the accuracy of mitochondrial datasets for phylogenetic reconstruction, and in such instances multi-locus nuclear datasets are preferred (Pamilo & Nei, 1988).
Size of the dataset and taxon sampling are also considered for choosing which tree will be applied. Phylogenetic trees reconstructed by a higher number of loci and a more comprehensive taxon sampling are preferred over those reconstructed by a fewer loci and poor taxon sampling.
If a taxonomic group has no published molecular phylogeny, morphological phylogenies are used to illustrate evolutionary relationships. Furthermore, if neither of them is available, taxonomy is used to reconstruct the tree as a simple taxon tree, where each taxon is assumed to be monophyletic. Such groups will remain that way until a proper cladistic analysis is published.
-Regarding the inference method
Model based methods (Bayesian inference and maximum likelihood) were given priority over maximum parsimony and neighbor joining. Maximum parsimony is considered inferior to model-based approaches because of its proneness to succumb to long branch attraction artifacts (Bergsten, 2005). Neighbor joining (Saitou & Nei, 1987) has the least priority of all methods because it is purely phenetic, and although it is rarely used for phylogenetic inference in the most recent years, it deserves some mention here as it prevails in several research papers of prokaryote phylogenetics.
There are still a lot of issues to be covered about the assemblage of the tree such as applying a consensus over different tree topologies, the problem of incomplete taxon sampling in studies with higher priorities, and the allocation of fossil taxa in a mainly molecular tree of life. Those issues will be discussed in the upcoming posts so be sure to be checking back our blog.
Bergsten, J. (2005) A review of long-branch attraction. Cladistics 21, 163-193.
Brown, W.M., George, M.Jr., Willson, A.C. (1979) Rapid evolution of animal mitochondrial DNA. Proceedings of the National Academy of Sciences 76, 1967-1981.
Moore, W.S. (1995) Inferring phylogenies from mtDNA variation: mitochondrial-gene trees versus nuclear-gene trees. Evolution 49, 718-726.
Pamilo, P., Nei, M. (1988) Relationships between gene trees and species trees. Molecular Biology and Evolution 5, 568-583.
Saitou, N., Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406-425.
Teeling, E.C., Springer, M.S., Madsen, O., Bates, P., O’Brien, S.J., Murphy, W.J. (2005) A molecular phylogeny for bats illuminates biogeography and the fossil record. Science 307, 580-584.