To a computational biologist, a string is a DNA sequence made up of the four basic nucleobases. Scientists are interested in the set of all possible subsequences of length k contained in such a string – a term they call k-mer.
Scientists count the number of k-mers for many reasons. For example, k-mer counting can help identify the similarities in genomic samples or validate a newly processed genome. But as genomic datasets become increasingly larger, k-mer counting faces two limitations: the amount of memory used in running a program and the amount of time taken in loading genomic data into memory efficiently.
To address these limitations, an international team of researchers from Argonne National Laboratory, four universities, and two supercomputing centers have developed Bloomfish – a highly scalable, memory-efficient k-mer counting framework.
“Bloomfish may seem like a strange name, but it comes from a popular technology that we leverage, called Jellyfish,” said Yanfei Guo, an assistant computer scientist in Argonne’s Mathematics and Computer Science Division. “We coined the name Bloomfish to reflect the fact that a bloom of jellyfish is a large collection that coexist with each other. Whereas the Jellyfish approach is based on a single-node counting framework, our approach works on a large collection of nodes and parallelizes them.”
For solutions that can run efficiently on distributed-memory supercomputers, the research team had to deal with another issue: performance loss due to I/O. In the traditional discrete I/O model, files are partitioned to different processes, often resulting in processes getting different amounts of data and making the total processing time limited to that of the slowest process. Instead, the researchers developed a stream I/O model for Bloomfish, in which files are viewed as segments of a continuous stream.
But even when the data size for each process is the same, the read time can vary greatly. The solution in this case was work stealing. “Again, the name may seem a bit odd, but the idea is really quite an old one in parallel programming,” Guo said. “If one process finishes the work, it will steal from other processes. This can reduce the impact of processing time skew caused by I/O performance variability.”
The success of these and other changes was demonstrated on a series of tests on the Tianhe-2 supercomputer in China. The results, using the 1000 Genomes dataset as well as a synthetic dataset, show that Bloomfish achieves unprecedented scalability: it can count 22-mers of a 24-terabyte dataset with 3,072 processes in about one hour.
“To the best of our knowledge, these are the largest datasets ever reported for k-mer counting problems,” said Guo.
For further information, see the paper: T. Gao, Y. Guo, B. Wang, Y. Lu, P. Cicotti, P. Balaji, and M. Taufer, “Bloomfish: A highly scalable distributed k-mer counting framework,” in the IEEE International Conference on Parallel and Distributed Systems, Shenzhen, China, 2017.