SmartFuse: reconfigurable smart switches to accelerate fused collectives in HPC applications
Files
Published version
Date
2024-06-03
Authors
Guo, Anqi
Herbordt, Martin Christopher
Version
OA Version
Citation
Pouya Haghi, Cheng Tan, Anqi Guo, Chunshu Wu, Dongfang Liu, Ang Li, Anthony Skjellum, Tong Geng, and Martin Herbordt. 2024. SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC Applications. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS '24). Association for Computing Machinery, New York, NY, USA, 413–425. https://doi.org/10.1145/3650200.3656616
Abstract
Communication switches have sometimes been augmented to process collectives, e.g., in the IBM BlueGene and Mellanox SHArP switches. In this work, we find that there is a great acceleration opportunity through the further augmentation of switches to accelerate more complex functions that combine communication with computation. We consider three types of such functions. The first is fully-fused collectives built by fusing multiple existing collectives like Allreduce with Alltoall. The second is semi-fused collectives built by combining a collective with another computation. The third are higher-order collectives built by combining multiple computations and communications, such as to perform matrix-matrix multiply (PGEMM).
In this work, we propose a framework called SmartFuse to accelerate fused collective functions. The core of SmartFuse is a reconfigurable smart switch to support these operations. The semi/fully fused collectives are implemented with a CGRA-like architecture, while higher-order collectives are implemented with a more specialized computational unit that can also schedule communication. Supporting our framework is software to evaluate and translate relevant parts of the input program, compile them into a control data flow graph, and then map this graph to the switch hardware. The proposed framework, once deployed, has the strong potential to accelerate existing HPC applications transparently by encapsulation within an MPI implementation. Experimental results show that this approach improves the performance of the PGEMM kernel, miniFE, and AMG by, on average, 94%, 15%, and 13%, respectively.
Description
License
© 2024 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution International 4.0 License. This article has been published under a Read & Publish Transformative Open Access (OA) Agreement with ACM.