SmartFuse: reconfigurable smart switches to accelerate fused collectives in HPC applications

Guo, Anqi; Herbordt, Martin Christopher

SmartFuse: reconfigurable smart switches to accelerate fused collectives in HPC applications

Files

3650200.3656616.pdf(2.38 MB)

Published version

Date

2024-06-03

DOI

10.1145/3650200.3656616

Authors

Guo, Anqi

Herbordt, Martin Christopher

URI

https://hdl.handle.net/2144/49416

Citation

Pouya Haghi, Cheng Tan, Anqi Guo, Chunshu Wu, Dongfang Liu, Ang Li, Anthony Skjellum, Tong Geng, and Martin Herbordt. 2024. SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC Applications. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS '24). Association for Computing Machinery, New York, NY, USA, 413–425. https://doi.org/10.1145/3650200.3656616

Abstract

Communication switches have sometimes been augmented to process collectives, e.g., in the IBM BlueGene and Mellanox SHArP switches. In this work, we find that there is a great acceleration opportunity through the further augmentation of switches to accelerate more complex functions that combine communication with computation. We consider three types of such functions. The first is fully-fused collectives built by fusing multiple existing collectives like Allreduce with Alltoall. The second is semi-fused collectives built by combining a collective with another computation. The third are higher-order collectives built by combining multiple computations and communications, such as to perform matrix-matrix multiply (PGEMM). In this work, we propose a framework called SmartFuse to accelerate fused collective functions. The core of SmartFuse is a reconfigurable smart switch to support these operations. The semi/fully fused collectives are implemented with a CGRA-like architecture, while higher-order collectives are implemented with a more specialized computational unit that can also schedule communication. Supporting our framework is software to evaluate and translate relevant parts of the input program, compile them into a control data flow graph, and then map this graph to the switch hardware. The proposed framework, once deployed, has the strong potential to accelerate existing HPC applications transparently by encapsulation within an MPI implementation. Experimental results show that this approach improves the performance of the PGEMM kernel, miniFE, and AMG by, on average, 94%, 15%, and 13%, respectively.

License

© 2024 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution International 4.0 License. This article has been published under a Read & Publish Transformative Open Access (OA) Agreement with ACM.

cb

Collections

BU Open Access Articles

Full item page