Optimizations on GPU
Abstract: Accelerator-based scientific data parallel computing has led to new application or architecture specific optimizations. One such optimization proposed is "kernel coalesce," which is used for optimizing the concurrent kernel execution on the NVIDIA Fermi GPU. The GPU consists of streaming multiprocessors, each of which has a fixed amount of resources in terms of thread blocks, number of threads, and registers. Each kernel is defined in terms of grid, and each grid is executed in terms of thread blocks. If a grid occupies all the resources, then another grid cannot execute, leading to serialization of kernel execution. Kernel coalesce optimization is proposed to prevent kernel serialization due to lack of resources. Thread-level coalesce partitions the resources to each kernel by modifying their grid and thread block dimensions to enable concurrent execution. Multi clock-cycle coalescing allows sharing of the resources across the kernels. Warp interleaving based coalescing allows slicing of the resources to enable concurrent kernel execution. Further, GPUs are not effective in accessing indirect addresses. Most of the -time applications process data in a sparse matrix format that uses indirect addressing. A new format named bit-level shift indexing (BLSI) is proposed to reduce the memory footprint and number of memory accesses per FLOP on GPU. A framework named Sparse Matrix AnalyzeR Tool (SMART) is proposed to predict an optimal sparse matrix format for SpMV computations on the GPU based on the statistics of the input sparse matrix and the given architecture.
Bio: Neelima Bayyapu is head of the Department of Information Science and Engineering at NMAM Institute of Technology, India. She completed her Ph.D. at the National Institute of Technology Karnataka, India, in the area of high-performance computing.