Phylogenetic analysis of molecular sequence data used to be confined to researchers in the field of evolutionary biology. However, research involving molecular evolution is now very much an interdisciplinary field, bridging many areas from computer science, ecology, and evolutionary biology to population genetics, molecular biology, biochemistry, and physical chemistry. The field is also moving fast, with an overwhelming variety of more advanced models and methods available to analyse data. As a result it is challenging for researchers to make sufficiently educated choices of what models and methods are most suitable for their data and purposes.
In response to this situation Maria Anisimova from ETH Zurich, Switzerland, and her fellow Section Editors on BMC Evolutionary Biology published a recent Editorial setting out some general guidelines to help with methodology choices at different stages of a typical phylogenetic ‘pipeline’. Anisimova explains more about the reasons behind this and related issues in evolutionary biology.
Anisimova is senior research fellow and lecturer at the Computer Science department of ETH Zurich where her main interests lie in the fields of bioinformatics, evolutionary genomics and computational molecular evolution. She has a wide range of interests spanning the theoretical aspects of modeling molecular evolution as well as data-driven applications of new methodologies.
What spurred you and your fellow Section Editors at BMC Evolutionary Biology to write an Editorial on how to conduct a rigorous phylogenetic study?
Serving as Editors, we receive numerous submissions with clear flows in phylogenetic analyses. Today, even for researchers with experience in phylogenetics, it is becoming increasingly hard to find their way through the many sophisticated methodological developments available, especially given the interdisciplinary nature of the field. Appreciating this difficulty and motivated to improve the quality of phylogenetic analyses in submitted manuscripts, we decided to write an Editorial discussing some basic principles necessary for a successful phylogenetic study.
Will it be relevant to researchers who don’t class themselves as traditional ‘evolutionary biologists’?
Certainly! Evolutionary (and phylogenetic) analyses are increasingly taking the role of important building blocks in traditionally non-evolutionary fields, including molecular biology and functional genomics. This is because people have gradually started realizing their value, particularly when working with new high-throughput data. New fields like evolutionary medicine and evolutionary ‘omics’ are emerging. Indeed, looking at the changes through time gives the study a new very powerful dimension.
The time scales people look into may vary from hundreds of millions of years (e.g. for ancient speciation or duplication events) to hours and days, as for example in studies of serial viral samples from infected patients. The molecular data we observe today is due to events of the past. So by studying the events of the past, we can uncover some intricate molecular rules, impossible to discover without a glance into the molecular history. We strongly believe in the value of an evolutionary perspective and that this will be an important part of the future trajectory of molecular biosciences. Note that among the books recommended in our Editorial are two edited volumes ‘Evolutionary Genomics: Statistical and Computational Methods’ published by Springer in 2012. These books provide a solid and up-to-date overview of current research directions (albeit not exhaustive), where the theme of ‘evolution’ creeps in and helps to make serious advances in genomics and omics.
What are the common mistakes and pitfalls of the phylogenetic tree-building process?
Typically these are due to poor choices of tree building methods and models, not dictated by objective scientific reasons. Often the use of outdated methods and software becomes problematic. Other, more intricate issues, include the distinction between the concepts of species and gene trees, as well as the lack of evidence for tree-signal in the data affected by ‘non-tree-like’ processes such as recombination, lateral gene transfer and gene conversion. These scenarios require more careful consideration and the use of non-standard methodology. Yet, even when ‘non-tree-like’ processes affect the evolution of molecules, the binary phylogeny model can still be successfully used (e.g. see part I of volume 2 of the Evolutionary Genomics book), without needing to resort to more mathematically complex methods that rely on the inference of a network structure rather than a tree.
Other issues include tree rooting and interpretations of rooted versus unrooted trees. Lastly, for any set of sequences one can build a tree, but that does not necessarily reflect homologous relationships. Ultimately, research today, including the models used and their justification, will be subject to continuous revision that may call into question some current best practices.
How important is data sharing to the progress of phylogenetic research?
Generally in science, data sharing is crucial. This ensures transparency, reproducibility and facilitates faster progress, so that researchers build new projects upon key results and inventions rather than reinventing the wheel multiple times. Phylogenetic research is not an exception to this rule.
What do you think the main obstacles to data sharing are for the phylogenetics community?
At the cornerstone of data sharing is the existence of unifying and widely acceptable databases and nomenclatures for various biological and bioinformatics data. The challenge is to maintain a framework that allows for easy, efficient access to up-to-date data (including embedded complex relationships), convenient data representation, and ideally related standard services. Bio-code sharing is a part of the data sharing issue, but this has been gradually addressed by freelance communities that create, maintain and error-check the specialized bioinformatics/phylogenetic open-source libraries and software.
One serious obstacle to realizing efficient data sharing solutions is the absence of consistent funding schemes from various international governmental sources. As the quantity and the complexity of biological data grows, data infrastructure solutions (storage, representation, availability) require substantial time investment, but cannot be funded by standard research grants, as this work is not classified as research. Limited possibilities for funding the development and maintenance of databases and software already exist, but are not yet sufficient to address the growling needs of this booming field.
How do you see data management developing in the future in response to the need to share data?
The future of data management lies with collaborative consortiums that aim towards standardization of data formats and nomenclature, as well as towards more structured approaches of data handling, such as the Semantic Web project. For bioinformatics and genomics specifically, large international projects will facilitate progress towards wider data integration and systems approaches if they can provide a platform for bringing together heterogeneous, complex data structures, incorporating the dependencies and other specifics of biological data. A successful framework needs to rely on close interdisciplinary international co-operations. The challenge however is for such large projects to remain dynamic – ready to replace outdated methodology or data structures with the state-of-the art versions, or otherwise risk becoming a hostage of their own shortcomings.
Questions from Emilie Aime, Executive Editor for the BMC Series, and Elizabeth Moylan, Biology Editor for BioMed Central.