It's All in the Data
It's All in the Data
From High-throughput Data Retrieval Towards Biomolecular Network Analysis in Health and Disease
Biological databases of high-throughput experimental results provide vast and growing resources for medical, and bioinformatic research. Open questions remain in how best to maintain such resources, access them computationally, meta-analyze their contents from hundreds of experiments, and do so reproducibly while maintaining computational best practices. We present ARepA, an extensible, modular Automated Repository Acquisition system for reproducible biological data acquisition and processing. ARepA allows configurable data access for any organism(s) from the GEO, IntAct, BioGRID, RegulonDB, STRING, Bacteriome, and MPIDB databases.
A user can retrieve raw data and metadata from these repositories, normalize data files, and automatically process them in standardized ways (e.g. for network analysis). When retrieving data from six model organisms, ARepA currently produces more than 2M interactions (600K physical interactions, 4K regulatory interactions, 1.5M functional associations) and 2.7K gene expression data sets covering approx. 800K samples, accompanied by corresponding metadata and derived network data. We include biological examples demonstrating the utility of ARepA for integrative analyses. When focusing on human data, ARepA's metadata database allowed us to identify and standardize 12 human prostate cancer gene expression datasets from GEO, which were subsequently meta-analyzed across six different platforms.
A subsequent co-expression network analysis correctly recovered the NfκB signaling pathway along with new candidate genes with roles in prostate cancer. A similar example in mouse integrates 11 gene expression datasets selected by querying ARepA for metadata indicating germ-free and intestinal tissue conditions. Finally, we provide the first steps toward computational recovery of mechanistic pathway components specific to the NFκB pathway as perturbed in prostate cancer. We leveraged recent advances in Bayesian data integration to simultaneously provide information specific to biological contexts and individual biomolecular mechanisms. We applied this method to identify mechanisms of interaction surrounding NFκB during its activity in cell death, inflammation, adhesion and differentiation. We integrated 18 prostate cancer specific expression datasets and 860 non-disease datasets from expression and protein interactions using ARepA.
Prior knowledge was further included (PathwayCommons) to inferred genome-wide networks for 442 biological processes (Gene Ontology), including 7 mechanisms of interaction ranging from general functional relationships through specific physical and regulatory activities. Among all inferred networks we focused on 11 biological context networks most informative in prostate cancer, as summarized above. The cell death network has so far included several of the highest-confidence links between NFκB1 and examples such as CCL2, HDAC1, TNF, and IKBKB. We are currently extending the inference process to encompass additional genes, data, and experimental follow-up. This computational method easily scales to integrate thousands of experimental results and to identify those data most informative regarding specific putative mechanisms of interaction in pathways surrounding genes of interest in cancer.