The Matern kernel is one of the most widely used covariance kernels in Gaussian process modeling; however, large-scale computations have long been limited by the expensive dense covariance matrix calculations. As a sequel of our recent paper [Chen et al. 2012] that designed a tree code algorithm for efficiently performing the matrix-vector multiplications with the Matern kernel, this paper documents the parallel design and the software implementation of the algorithm. The parallelization focuses on data and work load balancing and uses MPI passive one-sided protocols for communications. The software, implemented in C++, provides a flexible interface with rich functionality, together with examples to demonstrate the extraction of performance diagnostics. The code is intended to be used as building blocks for statistical calculations where the matrix-vector multiplication is among the most expensive computational components.

%G eng %1 http://www.mcs.anl.gov/papers/P5015-0913.pdf