An algorithm for decomposing large tally data in Monte Carlo particle simulations is proposed, analyzed, implemented, and tested in a production Monte Carlo code, OpenMC. The algorithm relies on disjoint sets of compute processes and servers of which the former simulate particles moving through the geometry and the latter runs in a continuous loop, receiving scores from the compute processors and incrementing tallies. A performance model is developed and shows that for a range of parameters relevant to LWR analysis, the tally server algorithm should perform with minimal overhead on contemporary supercomputers. An implementation of the algorithm in OpenMC was tested on the Intrepid and Titan supercomputers and was demonstrated to perform well over a wide range of the parameters. The tally server algorithm can thus be used to analyze LWR models with a level of fidelity that was heretofore not possible because of the need to replicate memory across all processors.

%B Journal of Computational Physics %V 252 %P 20-36 %8 11/2013 %G eng %1 http://www.mcs.anl.gov/papers/P4044-0313.pdf