The honey bee has both great economic and ecological importance due to its role as a major pollinator. It also serves as a model organism for studies into human health, including fields such as allergy and immunity, and is a focus of research into eusociality and group behaviour. All of these traits made the honeybee an attractive candidate for genome sequencing, leading to the generation of the first draft of its genome in 2006. However, this annotation was found wanting as the number of genes discovered appeared low when compared to other social insects. Christine Elsik from the University of Missouri, USA, Kim Worley from the Baylor College of Medicine, USA, and colleagues present an upgraded annotation of the honey bee genome in their recent study in BMC Genomics, revealing around 5000 more protein-coding genes than the previous annotation. Elsik and Worley explain more about what their results revealed, the issues around the first draft of the honey bee genome, and what lessons can be learned for future annotations.
What was the inspiration for this project?
People who work with unfinished genome sequences find gaps in the sequence that can cause errors in the translated proteins and problems for studies of non-coding sequences. These issues occur regardless of the sequencing technology used and are found in most sequenced genomes including the original honey bee genome. Only a handful of genomes have been ‘finished’ to a quality of one error in 10,000 base pairs, the standard of the human reference genome. The honey bee genome seemed particularly problematic, because parts of the original genome that were AT-rich were missing in early sequence data and were targeted for improvement, and the number of gene annotations seemed low compared to Drosophila and later sequenced Hymenoptera.
Why did people think the old honey bee genome assembly was poor?
In addition to the issues with draft assemblies we noted above, the low number of gene annotations suggested parts of the assembly were missing or assembled incorrectly. However, annotation also depends upon an accumulation of data from expressed sequences (RNAseq) and genomes from other species for comparison. Both of these types of data were limited for the old honey bee genome project, which predated the current sequencing technologies that have fostered more RNA sequencing as the cost has dropped, and which was the first Hymenoptera genome project. Only Dipterans (Drosophila and mosquito) and silkworm were available for comparative studies.
Why did it take so long to re-annotate the genome and what advances have now facilitated this re-annotation?
In an ideal world re-annotation would have been faster. The work was much more than running gene prediction software. The annotation process itself was a research project, and combining gene sets contributed by different sources required extra effort to reformat datasets. We tested several approaches and many different parameters, and adapted our methods when new datasets became available. We extensively evaluated alternatives before selecting a final gene set, and then further evaluated the selected set. We anticipate that re-annotation will be faster in the future because tools for RNAseq-based annotation are improving and methods described in each genome re-annotation publication will provide guidance for future projects.
Total gene number was used to infer the problems with the original annotation, and the final gene number after re-annotation was close to that originally predicted. Are better at predicting gene number or do we simply have more species for comparison?
Both. The evidence (RNAseq data and comparative species) are very important for improving the annotation. But our tools available to make use of these data have also evolved and improved. We think that there is still potential for improvement. We wonder if tuning the gene prediction algorithms to the GC content domains would improve the annotation further. Perhaps there are different features of genes found in AT richer domains versus GC richer domains and using different parameters for the predictions in the different domains would identify genes that were otherwise missed with the tuning for gene prediction in the genome average.
You ruled the out the presence of significant amounts of repetitive sequences in the honey bee genome. Do you think next generation sequencing techniques have resolved problems with repetitive sequences encountered in early genome sequencing attempts?
Although next generation sequencing technologies are much less expensive per base than earlier Sanger data and therefore projects can have much deeper raw sequence representation, so that unique sequences are of good quality, next generation, short read sequencing technologies are less capable of dealing with repetitive sequences. Sequence reads need to be long enough and with high enough quality to be uniquely placed and sequence reads or pairs of sequence reads need to be long enough or widely enough spaced with reliable inter-pair distance to step through longer repeat sequences. Short reads are often too short to do this. Longer read sequencing technologies are very helpful in this context.
Why do think repetitive sequences are so rare in honey bees?
The paucity of repetitive DNA in the honey bee genome remains a puzzle. The Honey Bee Genome Sequencing Consortium postulated the genome was low in retrotransposons due to haplodiploidy. Haploid drone genomes exposed to selection every generation would not tolerate disruption by retrotransposons (Nature. 2006, 443, 931–949). However, more recent Hymenoptera genome projects have reported larger numbers of retrotransposons in other organisms with similar haplodiploid lifestyles. Our analysis and the previous analysis (Nature. 2006, 443, 931–949) suggest that transposable elements were active and present in higher numbers in the past.
Apis mellifera has a few other unusual genome characteristics, including a high recombination rate and low and heterogeneous GC content with genes biased to lower than average GC content regions of the genome. Understanding evolutionary processes that have contributed to these characteristics may provide insight into the low repetitive DNA content.
Is the annotation of the honey bee genome an isolated case, or are there likely to be other poorly annotated genomes that could benefit from the same treatment? How wary should end users be of genome data?
Users of any data should be wary of the data quality, genome sequences and annotations are no exception. Trust but verify. Often people view the data in a genome browser and take the view as fact rather than drilling into the particulars of the underlying data to see where there are regions that are more reliable or less reliable (having gaps and low quality sequence). We have ongoing efforts to improve the contiguity of existing genome sequences with PacBio sequence and the PBJelly tool (PLoS One. 2012, 7 (11) e47768) and we have a number of low coverage Sanger genomes that have been improved and are being prepared for publication.
What are the lessons to be learned for future genome annotation projects?
Efforts to improve genomes and genome annotations are useful exercises that depend upon better underlying data (RNA sequence and comparisons with other high quality genome sequences) as well as improved automated annotation methods. An ongoing challenge is that with new types of data there is always a need to evaluate and revise computational approaches. For example, genome-guided reconstruction of transcripts from RNAseq has been improving over the last couple years, and future genome annotation projects need to leverage the most recent advances.