AC2 Lab studies sparse, irregular, and communication-sensitive AI/HPC systems. We are interested in workloads whose efficiency is shaped not only by raw computation but also by irregular memory access, dynamic data dependence, data movement, synchronization, and system-level communication. Our long-term goal is to establish systems and runtime principles that make such workloads continuously executable on accelerators. This direction is motivated by future AI for Science and high-fidelity digital twins, where models must be updated and re-executed as observation, learning, and intervention proceed.
Wafer-Scale Systems for GNN Training
One of our current projects studies GNN training on wafer-scale systems, using Cerebras CS-3 as a first case study in sparse, irregular, and communication-sensitive AI workloads. GNN training on GPU systems often suffers from irregular memory access and distributed-memory overhead, making efficiency sensitive to graph structure, partitioning, and synchronization. We investigate when a wafer-scale system, with a simpler execution model and a large fast memory system, can improve training efficiency compared with conventional multi-GPU systems. Our immediate goal is not to claim a universal replacement for GPUs, but to clarify where wafer-scale systems become beneficial for graph learning workloads.
Communication Compression and Offload for Multi-GPU LLM Training
A second project focuses on multi-GPU LLM training, where communication is often a central scaling bottleneck. We investigate whether communication-data compression on NVIDIA BlueField DPUs can reduce collective-communication costs and improve end-to-end training throughput. This work studies co-design across GPUs, DPUs, and the communication software stack to determine when data should stay on the main training path and when it should be compressed and offloaded. While LLM training is itself a dense workload, we use it as a practical testbed for communication-substrate research that will also matter in future sparse and irregular accelerator workloads.
Emerging Direction: Runtime Support for Dynamic Sparse and Irregular Workloads
Looking ahead, we are interested in a broader systems question: how should sparse, irregular, and communication-sensitive workloads be mapped across multi-GPU clusters, GPU+DPU systems, and wafer-scale machines? Our long-term goal is to develop runtime principles that convert dynamic dependence patterns into information for data placement, communication, and execution-order control. This direction connects our current GNN/wafer-scale and LLM/DPU projects into a unified agenda on accelerator systems for future AI for Science and digital twins.
Long-Term Perspective
We view today’s GNN and large-scale training workloads as practical testbeds for a broader problem: sustaining sparse and irregular computation on accelerators. By studying irregular memory behavior, communication bottlenecks, and heterogeneous execution on real systems, we aim to establish principles that remain useful as future platforms grow larger, more specialized, and more tightly coupled to real-world observation, learning, and intervention.
Representative publications and artifacts will be listed on the Publications page as they become available.