Research moves at the speed of availability

October 31, 2012

Researchers sequence hundreds of genomes across the broad tree of life to explore diversity. They also use sequence data to examine the differences between closely related microbial strains to understand evolving variation and adaptation. There are several important outcomes from such work, including finding ways to prevent, diagnose and fight threats such as SARS, the H1N1 flu virus and deadly E. Coli outbreaks. In combatting such quickly emerging hazards, speed in identifying the culprit and how best to attack it is critical for both scientists and the general public.

To benefit from sequence data, researchers must perform several steps. To begin, they assemble the sequenced fragments of DNA as completely and as accurately as possible into a genome, and then annotate the reconstructed genome. Genome annotation is the process of attaching meaningful information to a genome's sequence. It involves first identifying candidate regions most likely to be genes, and then determining what the code in those regions probably does. Annotation includes assigning functions to regions, such as biological, biochemical, regulation and interactions, and expression information. An assembled genome without annotations is as meaningful as a completed jigsaw puzzle with no images stamped onto the pieces.

A suite of Argonne-supported tools has been widely adopted by researchers to store and analyze microbial data. The Rapid Annotation using Subsystem Technology (RAST) server is a fully automated service providing high quality genome annotations for an incredibly diverse assortment of prokaryotes (bacteria and Archaeans). It makes a quality annotation available as a service in approximately 6-12 hours using data from another resource called The SEED. The SEED environment and data structures (most prominently similar sets of proteins that share a role) are used to compute the automatic annotations. The genome annotation provided includes a mapping of genes to subsystems (like metabolic pathways) and a metabolic reconstruction of those pathways. A third resource, Model SEED, continues the analysis by offering high-throughput generation, optimization and analysis of genome-scale metabolic models. Metabolic models allow an understanding of what microbes can do, how they grow, and how they obtain energy from nutrients.

Because time is of the essence in submitting and retrieving critical data and results, researchers must make it as easy as possible to access large amounts of data. In the past, data submission was via a web interface that permitted entry of only a single genome at a time. Further, researchers have had difficulty maintaining current copies of the SEED data and code at remote locations due to cumbersome and time-consuming downloads. Several users have experienced limitations in using RAST tools because of low throughput, long processing times, and broadband issues.

Now, a redesign of the SEED's servers has solved most of these issues, opening up high-performance remote access to the SEED database, and offering programmatic access to the RAST server. Currently, through programmatic access, users can quickly submit genomes in bulk and retrieve them the same way. Data is always up-to-date; users can ask for a particular subset of data and easily retrieve it. In addition to faster processing, the redesign also allows investigators to design their own SEED-based tools, and gives users more security and privacy, which is especially important for scientists working on sensitive information.

The RAST annotation server now supports the annotation of several hundred genomes per month, and has so far been used for the annotation of more than 50,000 viral and prokaryotic genomes since 2007. The SEED family of resources now collectively houses almost 5,000 distinct prokaryotic genomes associated with 30,000 annotations, 11,000 metabolic models, 178,000 protein families, 10,250 functional roles, and 1,060 subsystems. In other words, it is a well-equipped arsenal for identifying and understanding microbial life to battle biothreats, but also to repurpose beneficial microbes to solve humanity's environment and energy production problems.

Access to the servers, the underlying data, and the code is free to all users. A full description of the redesign and new functionality are described in a recent PLOS ONE publication titled "SEED Servers: High-Performance Access to the SEED Genomes, Annotations, and Metabolic Models," which was written by a team of scientists affiliated with Argonne, The University of Chicago, and the Fellowship for Interpretation of Genomes (FIG).

This work was supported in part with Federal funds from the National Institutes of Health, the Department of Health and Human Services, the U.S. Department of Energy Office of Science, and the National Science Foundation.