Some Thoughts on the Creation of a Phenotypic Ontology of Prokaryotes
Predictive models depend on high quality input data, but not all data are of similar quality nor are all of the data amenable to computational analysis without extensive cleaning, interpretation and normalization. Phenotypic data is a prime example. These data are more complex than sequence data, occur in a wide variety of forms, often use non-uniform descriptors that change over time and are scattered about, mainly in the scientific and technical literature or in specialized databases. Incorporating these data into repositories such as the DOE Kbase requires not only expert ise in harvesting and modeling the data, but also knowledge in interpreting the data in the correct biological context. While it is generally agreed that access to such data would be invaluable for genome and metagenome analysis, capturing it is a non-trivial undertaking.
We are approaching this problem in a stepwise fashion, by first creating a standardized terminology of phenotypes for Bacteria and Archaea, derived from the taxonomic literature. To date, we have developed tokenizers that work well on a variety of document and data types that have allowed us to compile a list of approximately 40,000 terms that were used in the published descriptions of 5,750 type strains of Bacteria and Archaea. These terms are being placed into a phenotypic ontology that will be incorporated into the NamesforLife database, from which it will become available for transclusion into community resources such as the Kbase, into machine generated descriptions for publication, and served over top of published literature using annotation services that were initially developed for biological names.
Terminological services can also be integrated into end-user applications that will allow for easier capture of phenotypic data that is normalized and persistently linked to the appropriate bio-sample at the earliest possible point in the discovery process. And, like the names and taxonomic concepts applied to the organisms themselves, it becomes feasible to provide this information for other uses as a semantic service via DOIs, automatically resolving any semantic ambiguities that arise over time. It also becomes feasible to index the literature and other digital resources based on broad phenotypic concepts rather than the descriptive terms.