A Machine Learning-based Approach to Improve the Performance of Transactional Memory Applications and Runtimes on Multicores
Multicore processors are now a mainstream approach to deliver higher performance to parallel applications. In order to develop efficient parallel applications for those platforms, developers must take care of several aspects, ranging from the architectural to the application level. In this context, Transactional Memory (TM) appears as a programmer friendly alternative to traditional lock-based concurrency for those platforms.
It allows programmers to write parallel code as transactions, which are guaranteed to execute atomically and in isolation regardless of eventual data races. At runtime, transactions are executed speculatively and conflicts are solved by re-executing conflicting transactions. Although TM intends to simplify concurrent programming, the best performance can only be obtained if the underlying runtime system matches the application and platform characteristics.
The contributions of this work concern the analysis and improvement of the performance of TM applications based on Software Transactional Memory (STM) on multicore platforms. Firstly, we show that the TM model makes the performance analysis of TM applications a daunting task. To tackle this problem, we propose a generic and portable tracing mechanism that gathers specific TM events, allowing us to better understand the performances obtained. The traced data can be used, for instance, to discover if the TM application presents points of contention or if the contention is spread out over the whole execution. Our tracing mechanism can be used with different TM applications and STM systems without any changes in their original source codes.
Secondly, we address the performance improvement of TM applications on multicores. We point out that thread mapping is very important for TM applications and it can considerably improve the global performances achieved. To deal with the large diversity of TM applications, STM systems and multicore platforms, we propose an approach based on Machine Learning to automatically predict suitable thread mapping strategies for TM applications. During a prior learning phase, we profile several TM applications running on different STM systems to construct a predictor.
We then use the predictor to perform static or dynamic thread mapping in a state-of-the-art STM system, making it transparent to the users. Finally, we perform an experimental evaluation and we show that the static approach is fairly accurate and can improve the performance of a set of TM applications by up to 18%. Concerning the dynamic approach, we show that it can detect different phase changes during the execution of TM applications composed of diverse workloads, predicting thread mappings adapted for each phase. On those applications, we achieve performance improvements of up to 31% in comparison to the best static strategy.
Jean-François Méhaut is Professor of Computer Science at the Université Joseph Fourier (UJF) since 2003. He currently holds a research position at CEA, on secondment from UJF. His current research includes embedded systems as well as all aspects of high performance computing including runtime systems, multithreading and memory management in NUMA multiprocessors, multi-core and hybrid programming.
He is participating in the European Mont-Blanc projects (http://http://www.montblanc-project.eu/) for a scalable and power efficient HPC platform based on low-power embedded technologies. Jean-François Méhaut is involved in scientific collaborations with several Brazilian Universities (Porto Alegre, Sao Paulo, Belo Horizonte). He has supervized more than 20 PhD students with strong interactions with european industry and compagnies such as Bull, ST Microelectronics, Kalray.