Big data has enabled researchers in life science to tackle complex problems in medicine, identify the drivers of diseases, and recommend potential treatments. But the huge amount of biomedical data also presents challenges: the data must be easily findable, accessible, interoperable, and reusable, or FAIR.
Achieving FAIRness is not an easy task. Methods seldom scale to big data. Formats for referencing data differ. Data often is distributed, requiring authentication with diverse institutional credentials. Common interfaces for accessing data are lacking.
But a team of researchers led by Argonne National Laboratory now have demonstrated that these difficulties can be overcome by leveraging a set of existing tools that facilitate data access, analysis and sharing regardless of scale or location.
“We wanted to use existing tools rather than requiring scientists to master yet another set of tools,” said Ravi Madduri, a computational scientist in Argonne’s Data Science and Learning division.
The tools selected included BDBag and Minid, for referencing and locating data and exchanging names; Globus, for data management, authentication and authorization; Globus Genomics, for parallel cloud-based computation; and Docker, for capturing a complete software stack in a “container” form that can be run on many platforms.
“Together these tools form a kind of data commons, providing a shared virtual space for handling big data computations easily and reliably,” Madduri said.
To demonstrate what can be achieved in this space, the team collaborated with researchers from the University of Chicago, the University of Southern California, Irvine, and the Institute for Systems Biology in Seattle on a case study in which they retrieved large datasets from the ENCODE public repository and used parallel cloud and workstation computation to identify candidate transcription factor binding sites. These sites are important in understanding the mechanisms involved in gene regulatory networks.
In addition to using quantitative methods to measure FAIRness, the researchers asked 11 independent researchers to reproduce an analysis using the tools and a set of instructions. Of the 11 participants, 10 were able to successfully complete the analysis; one participant had trouble installing the programming language R needed to verify the results.
“The fact that most of these participants were able to reproduce our results is encouraging,” Madduri said. He added that “the feedback we received from the participants will be invaluable in improving the tools.”
For a full description of the work, see the paper “Reproducible big data science: A case study in continuous FAIRness,” by Ravi Madduri, Kyle Chard, Mike D’ Arcy, Segun C. Jung, Alexis Rodriguez, Dinanath Sulakhe, Eric W. Deutsch, Cory Funk, Ben Heavner, Matthew Richards, Paul Shannon, Gustavo Glusman, Nathan Price, Carl Kesselman, Ian Foster, PLOS One, April 11, 2019. https://doi.org/10.1371/journal.pone.0213013.
For an interesting article about the work, see the University of Chicago article