Optimizing Energy Efficiency in Large Language Model Inference Through Phase-Aware GPU Scheduling
Events section menu
Abstract: Current large language models (LLMs) result in higher accuracies, few-shots learning capabilities and even human-like emergent abilities on a wide range of language tasks. However, these models are expensive to train and can be memory-bound and compute-intensive during inference. Hardware vendors have developed numerous artificial intelligence (AI) accelerators, including both general-purpose GPUs and dataflow architectures to accelerate the LLMs inference.
In this work, we leverage LLMs inference benchmarks to evaluate performance and energy efficiency of various AI accelerators, including Nvidia GPUs, AMD GPUs, Intel GPUs, SambaNova SN40, Cerebras CS-3 and Groq GroqRack. To provide an in-depth analysis, we conduct our experiments on these AI accelerator systems with models of different sizes (from 7B to 670B) and architectures (dense and mixture of experts (MoE)), with different parallel configurations (data parallelism, tensor parallelism and expert parallelism), precisions (bfloat16 and fp8), and with different batch sizes.
For a single device experiment, we compare the performance of a single GPU with models ranging from 7B to 32B parameters. For a node-wise experiment we compare the performance and energy efficiency of multi-GPU nodes on models ranging from 7B to 235B parameters, conductiing multiple combinations of tensor and data parallelism. We also compare expert and tensor parallelisms for MoE models with different numbers of experts (from 8 to 128) and sizes (from 56B to 670B parameters).
Bio: Yiheng Tao is a Ph.D. student in Computer Science at the University of Illinois Chicago. He is currently serving as a research aide–technical at Argonne, hosted by Dr. Xingfu Wu.
Bio: Giacomo Brunetta is a Double MSc student at the University of Illinois Chicago and Politecnico di Milano.