TRIP: An Ultra-Low Latency, TeraOps/s Reconfigurable Inference Processor for Multi-Layer Perceptrons
Abstract: The multi-layer perceptron (MLP) is one of the most commonly deployed deep neural networks, representing 61% of the workload in Google data centers. MLPs have low arithmetic intensity, which results in memory bottlenecks. To the best of our knowledge, the Google Tensor Processing Unit (TPU) is the state-of-the-art implementation of MLP inference. TPU addresses the memory bound by processing multiple test vectors simultaneously to increase operations per weight byte loaded from DRAM. However, inference typically has hard response time deadlines and prefers latency over throughput. As a result, waiting for sufficient input vectors to get good performance is not feasible.
In this work, we designed a TeraOps/s Reconfigurable Inference Processor (TRIP) for MLPs on FPGAs that alleviates the memory bound by storing all weights on chip and ensures performance is invariant of input batch size. For large databases that cannot directly fit on chip, deep compression relaxes memory footprint requirements with no effect on accuracy. TRIP can be deployed as a stand-alone device directly connected to data acquisition devices, as a co-processor where input vectors are supplied through OpenCL wrappers from the host machine, as well as in a cluster configuration where on-chip transceivers can communicate between FPGAs. By comparison, TPU can only be used in a co-processor configuration. Our design achieved 3.0 TeraOps/s and 1.49 TeraOps/s on Altera Arria 10 for stand-alone/cluster and co-processor configurations respectively, making it the fastest real-time inference processor for MLPs.
Bios: Chen Yang is a Ph.D. student in electrical and computer engineering at Boston University. His research focus is on FPGA-based reconfigurable architecture design for a variety of applications. He received his M.S. degree in computer engineering and B.E. degree in electrical engineering at from the University of Florida and Wuhan University, respectively.
Ahmed Sanaullah is pursuing his Ph.D. in computer engineering from Boston University. His area of specialization is scalable and reconfigurable high-performance computing in clouds and clusters. He received his masters degree in electrical and electronic engineering from the University of Nottingham.