Unclogging the AI Data Pipeline—Three Critical Bottlenecks Constraining AI Clusters
It’s not just GPUs, but memory, storage, and networking capabilities that determine AI infrastructure performance and efficiency.
THE OMNIPRESENT BATTLE WITH BOTTLENECKS
Let’s face it: The task of scaling data centers and computing infrastructure is a daunting assignment, fraught with potential pitfalls that can impair efficiency, utilization, user experience, reliability, security, and serviceability. Scaling AI infrastructure takes these challenges to a new level due to the massive computing capabilities of today’s GPUs, along with the tremendous capital costs and power supplies needed to deploy them.
Highlighting the potential for lost efficiency is Huawei’s April 2025 unveiling of its CloudMatrix 384—a massive AI cluster boasting 384 Ascend 910C processors and delivering 300 PFLOPs of BF16 compute. While this system surpasses NVIDIA’s GB200 NVL72 in raw performance, it consumes approximately 559 kW of power, making it 2.3 times less power-efficient than its competitor. Looking back at the Open Compute Project Global Summit in 2023, Meta’s Keynote highlighted infrastructure characteristics that were critical to the various types of AI workloads: compute capacity, memory capacity, memory bandwidth, network latency, and network bandwidth. Fast-forward to 2025, and not only have GPU advancements in computing capacity outpaced advances in the memory and networking domains, but storage has emerged as a top concern in AI infrastructure deployments.
THE MEMORY WALL: AI’S MOST PERSISTENT BOTTLENECK
As AI models expand in complexity and size, memory bandwidth and capacity have emerged as critical chokepoints.
The “memory wall” refers to the growing disparity between CPU/GPU processing speeds and the ability of memory subsystems to supply data fast enough. From 2003 to 2023, compute performance improved by 60,000x, while DRAM bandwidth only improved by about 100x; latency has barely changed.
This gap becomes more problematic with deep learning models such as GPT-4 and Llama 2, which process massive datasets in real time. With traditional system architectures, memory is tightly coupled with processors. Each processor has a limited number of memory channels, constraining the memory capacity and bandwidth per processor. As the number of cores per processor grows, the memory capacity and bandwidth per core can decline, creating a bottleneck for performance. Expanding memory has required adding more CPU sockets or entire servers—a scaling method that is inefficient in cost, power consumption, and physical space. This architectural rigidity is a key contributor to the underutilization of accelerators: GPUs sit idle while waiting for data to arrive.
NETWORKING AND STORAGE STRAIN: WHERE AI FALTERS
Beyond memory, another critical bottleneck lies in storage and network performance. AI workloads, especially during training, are highly “data-intensive”—involving not only large compute cycles but repeated read/write operations, often in random patterns. Whether loading training datasets, accessing embeddings, or performing checkpointing, fast and consistent storage access is key.
Unfortunately, traditional network-attached storage and HDD-based systems struggle under this demand. Latency spikes can ripple through the entire compute stack. For instance, during checkpointing (saving the state of a model mid-training), all GPUs attempt to flush large amounts of data simultaneously. If the underlying storage can’t absorb the I/O (input/output) quickly, training stalls, wasting GPU cycles and driving up costs.