Conifers are known to have large and highly complex genomes in the range of 20 to 40 Gbps. One of its members, the loblolly pine (Pinus taeda), is the second most common tree species in the USA making it vital to American forestry, and is also a feedstock for the generation of biofuels. With over 1.5 billion loblolly pine seeds planted each year, a large majority of which have been genetically bred for improvement, this pine tree was an ideal candidate for the generation of a reference genome for conifers. In a recent study in Genome Biology, Charles Langley and David Neale from the University of California, Davis, USA, Jill Wegrzyn from the University of Connecticut, USA, Steven Salzberg from Johns Hopkins University, USA, and colleagues, describe how they sequenced and assembled the first full length genome of the loblolly pine, making this the longest genome sequenced to date at 22.18 Gbps. Here Langley, Salzberg, Neale and Wegrzyn discuss how they overcame the challenges associated with sequencing such a large genome.
Why is loblolly pine an important species to study and what led you to sequence its genome?
SS: Loblolly pine is the number one commercial tree species in the USA, used for a wide range of products, especially paper and construction timber.
DN: Loblolly pine has been used extensively in genetic studies because of the availability of multi-generation pedigrees developed by the breeding cooperatives. Thus, all kinds of useful genetic resources were available in loblolly pine that would not be found in other pine/conifer species.
CL: Like a number of other reference genome sequences the loblolly genome serves as a solid and fertile foundation for investigations at many levels, from pathogen resistance and efficient breeding to the comparative genomics of terrestrial plants. From a technical perspective this sequencing project moves the scale and integration of technologies involved in next-generation whole genome sequencing (NG-WGS) up a level. Also noteworthy is the fact that this genome sequence was created in a collaboration with a few modest laboratories rather than a large sequencing center.
My own motivation for contributing to this project derives from its value in the study of population genomics. Natural populations of loblolly pine are large and well-studied for many interesting traits. This makes them ideal for testing population genetics theories. Studies to understand the origin, maintenance and divergence of the underlying genomic variation depend on this high quality reference sequence.
What challenges did you encounter when sequencing and assembling the loblolly pine genome, and what strategies did you take to overcome these challenges?
CL: While the increasing cost efficiency of present day next generation sequencing (NGS) made the direct cost of the sequencing such a large genome manageable, the complexity and heterozygosity of the available DNA made the assembly daunting. By choosing to conduct most of the sequencing in the haploid genome of a single gamete (pine nut) of the target tree and by very effectively error-correcting and pre-assembling the mountain of reads, we were able to present the state-of-the-art assembler with a manageable scale of input data.
As mentioned above this project was conducted in several small labs. Creative and effective planning, open exchange and strong, focused collaborative commitment were each necessary but not always easy to achieve among fiercely independent scientists.
DN: This is a very key point and the credit goes to Chuck Langley for understanding the importance of open and constant dialogue among team members. This led to a very creative process that would not have been achieved otherwise.
SS: The enormous size of the genome was the main challenge. At the time we started, no existing software could assemble a genome of this size – it would simply exceed the memory capacity of any available computer and then crash. The assembly team, at the University of Maryland and Johns Hopkins University, USA, developed a new algorithm that could reduce most of the data by about 100-fold, which was critical to getting the genome put together.
We also began developing a new method to use fosmids – small genomic chunks about 38 kilobases in length – as an aid to assembly. We found that we can pool together as many as 5000 fosmids and then disentangle them computationally. This approach is still in development, but we’ve already used it for part of the loblolly assembly.
The use of a haploid genome was also key: it’s rare to be able to get haploid (rather than diploid) DNA for a multi-cellular organism. The biology of the pine tree helped us out here: a pine nut contains a significant quantity of haploid DNA.
What is the importance of generating a high quality genome assembly, and how does the quality of the loblolly pine genome assembly compare with other sequenced plant species?
CL: It is widely recognized that a full high quality reference sequence can drive rapid advances. It is less well appreciated that an incomplete reference genome rife with errors can waste precious talent and effort, ultimately slowing and diverting science.
While the present loblolly pine sequence is incompletely assembled, it is a solid foundation. The error rate is low. But version 2.0 is ‘baking in the oven’.
SS: A high quality assembly provides the basis for a great variety of downstream research. Once we have the assembly in hand, we can identify all the genes and then begin to link genes to phenotype, as we have been doing for more than a decade now with the human genome. It all starts with the genome itself.
DN: The quality and open access approach used with the loblolly genome means that it will serve as the reference for about 400 conifer genomes that will be sequenced in the years ahead.
How did the high quality of the loblolly pine genome assembly affect gene annotation and the insights gained into gene family evolution?
JW: The combination of a high quality genome assembly with long scaffolds and a comprehensive transcriptome generated from multiple tissue types provided evidence to describe over 50,000 genes. Several conifer genes have long introns that exceed 20 Kb in length and these would have been difficult to identify with shorter scaffolds. The full length genes allowed us to perform comparisons with protein sequences from several fully sequenced plant genomes and further investigate those specific to pine.
How were you able to utilize the genome assembly to identify genes underpinning important traits, such as disease resistance?
JW: John Davis at the University of Florida, USA, and his colleagues identified a single nucleotisde polymorphism (SNP) associated with fusiform rust resistance in loblolly pine. This genetically-mapped SNP was originally identified in a partial expressed sequence tag (EST). Availability of the genome and transcriptome positively identified the partial EST as a Toll-Interleukin Receptor / Nucleotide Binding / Leucine-Rich Repeat (TNL) gene. Analysis of orthologous proteins from several plant species indicated that this gene belongs to a class of TNLs that have expanded in conifers.
How do you think the availability of the loblolly genome sequence and assembly will aid future research?
CL: It will enable functional genomics in conifers and genomic selection (modern breeding). It will be an essential component of plant comparative genomics and will also serve as the essential reagent in population genomics investigations and genome wide association studies.
DN: It will provide a genetic resource for ecological genomics research that will facilitate better management of forests under changing climate conditions.