GPU-Accelerated Storage Fundamentals #1 — Why AI Needs a New Kind of Storage-- Extremely High Input-Output Per Second Storage
Modern AI workloads have fundamentally changed how storage must deliver data. Instead of relying on large, sequential transfers, applications such as vector databases, recommendation systems, and generative AI inference depend on millions of small, random I/O operations—typically at granularities between 512B and 8KB. To keep GPUs fully utilized, storage must sustain extremely high IOPS and deep concurrency, delivering fine-grained data at microsecond latency rather than simply maximizing sequential bandwidth.
This shift coincides with the rapid rise of GPU-centric computing. Since the public release of ChatGPT in 2023, extreme processing driven by GPUs has rapidly become the dominant model for modern data centers, surpassing traditional CPU-centric processing in 2024 and projected to account for nearly 90% of infrastructure by 2030. GPUs can process data at extraordinary rates, but their performance depends entirely on the system’s ability to feed them continuously.
However, most storage architectures were designed for a different era. Traditional systems optimized for capacity and throughput, with CPUs orchestrating data movement between storage and compute. This CPU-mediated model cannot sustain the massive parallelism and fine-grained access patterns of AI workloads. As a result, the primary bottleneck in modern AI infrastructure is no longer compute capability—it is storage’s ability to deliver data at the scale and concurrency GPUs require.
In this article, we’ll explore in detail why AI requires a new kind of storage architecture and what has fundamentally changed in the relationship between compute and data.
- The rise of the data-hungry GPU
- AI workloads are inherently parallel and data-intensive
- The real bottleneck: IOPS, not bandwidth
- Traditional storage was built for capacity, not concurrency
- FAQ
1.The rise of the data-hungry GPU
Modern GPUs operate on a fundamentally different execution model from traditional CPUs. CPUs are optimized for sequential control and centralized coordination, but GPUs are designed for massive parallelism—executing thousands of threads simultaneously across highly optimized compute units. In today’s AI-driven data centers, GPUs are no longer supplemental accelerators; they are the primary engines of computation.
This architectural shift introduces a new reality: GPUs are inherently data-hungry by design. With terabytes-per-second HBM bandwidth and high-speed PCIe Gen6 interconnects delivering over 100 GB/s per link, GPUs can process data at extraordinary rates. But this compute capability only translates into performance if data arrives continuously and at scale. When thousands of GPU threads run in parallel, even small inefficiencies in the data path multiply rapidly.
As a result, modern AI systems are increasingly feed-bound. When storage cannot sustain the required concurrency, GPUs stall as they are waiting to be fed. The bottleneck in AI infrastructure has shifted from computation to data delivery, redefining how the entire storage stack must be designed.
2. AI workloads are inherently parallel and data-intensive
The shift toward generative AI has also changed the nature of data access patterns.
Traditional enterprise workloads often relied on large sequential transfers and moderate I/O concurrency. AI workloads, particularly in inference and predictive systems, behave very differently. Vector databases, recommendation systems, retrieval-augmented generation (RAG), and graph neural networks (GNN) depend heavily on fine-grained data retrieval.
Typical I/O sizes for AI workloads range from 512B to 8KB. Instead of streaming large files, AI systems perform millions of small, random lookups—fetching embeddings, traversing graph edges, or retrieving sparse feature vectors.
At the same time, datasets are enormous. A production-scale vector database may require 40 TB of memory. Graph adjacency structures and embedding tables can exceed host memory capacity by multiples. In many cases, the working set is far larger than available DRAM.
This creates a dual pressure:
- Extremely high concurrency
- Extremely large datasets
Storage must therefore act not just as capacity, but as an active, high-performance data tier capable of sustaining fine-grained, parallel access.
3. The real bottleneck: IOPS, not bandwidth
We all tend to estimate storage performance in terms of bandwidth. PCIe Gen6 offers more than 100 GB/s per x16 link. SSDs advertise impressive throughput numbers. Network fabrics scale to hundreds of gigabits per second.
But for small-block AI workloads, bandwidth is not the limiting factor.
The true constraint is the system’s ability to sustain extremely high IOPS under deep concurrency.
This relationship can be understood through a simple principle described in this Cornell’s paper:
Qd=T∗L
Where:
- Qd is the required queue depth (in-flight requests)
- T is throughput
- L is average latency
For example, consider a PCIe x16 Gen6 link providing 104 GB/s of bandwidth.
T=104 GB/s÷512B≈208 M IOPS
For 4KB accesses:
T=104 GB/s÷4KB≈26 M IOPS
Assuming an average SSD latency of 100 microseconds (100 µs): The required parallel requests of 512B and 4KB are as below:
- Qd (512B) = 208M × 100µs ≈ 20,800
- Qd (4KB) = 26M × 100µs ≈ 2,600
This means that more than 20,000 concurrent in-flight requests must be sustained to fully utilize the link at 512B granularity. Without maintaining this level of parallelism, theoretical bandwidth cannot be reached.
Traditional CPU-driven storage stacks struggle to sustain this level of parallelism. Even if the underlying NVMe devices support multiple queues, software orchestration and interrupt handling often limit effective concurrency. As a result, SSD hardware is underutilized—not because it lacks speed, but because the control plane cannot scale.
In AI environments, performance is governed by IOPS scalability and queue depth—not raw sequential throughput.
4. Traditional storage was built for capacity, not concurrency for GPU computing
For decades, storage systems were optimized around capacity and throughput. Key performance indicators focused on terabytes per total cost of ownership (TB/TCO) and sequential bandwidth. This model worked well for archival systems, databases, and streaming workloads.
AI changes the metric.
When workloads are dominated by small, random I/O at extreme concurrency, the meaningful measure becomes IOPS per total cost of ownership (IOPS/TCO). Storage must scale in parallel with GPUs, maintaining deep in-flight queues and microsecond-level responsiveness.
Traditional architectures—where storage I/O flows through a centralized CPU control plane—were never designed for accelerator-scale parallelism. As long as I/O orchestration remains serialized, scaling compute will only amplify bottlenecks downstream.
AI does not simply require faster storage devices. It requires storage designed for concurrency first.
This structural shift sets the stage for a new generation of storage architectures—ones aligned with the execution model of modern GPUs. In the next article, we will explore what GPU-accelerated storage is, and how it restructures the data plane to meet these demands.
5. FAQ
1. Why can’t traditional storage architectures support modern AI workloads?
Traditional storage systems were designed for sequential throughput and moderate concurrency, optimized around CPU-mediated control paths. AI workloads generate millions of small, random I/O requests that require deep parallelism. CPU-centric orchestration becomes a bottleneck when concurrency scales to accelerator levels.
2. What does “GPU data-hungry” mean in practical terms?
GPUs execute thousands of threads simultaneously and require a continuous stream of data to maintain full utilization. If data is delayed, GPU cores stall and performance drops. As a result, AI systems are increasingly constrained by data delivery rather than compute capability.
3. Why are small-block I/O patterns important for generative AI?
Workloads such as vector search, recommendation systems, and graph neural networks rely on 512B–8KB random reads. These fine-grained lookups occur at extremely high concurrency levels. Performance therefore depends more on IOPS scalability than on large sequential bandwidth.
4. What does shifting from TB/TCO to IOPS/TCO mean?
Traditional storage metrics emphasized capacity efficiency measured in terabytes per cost. AI workloads prioritize how many I/O operations can be sustained per dollar invested. This shift reflects the move from capacity-driven design to concurrency-driven architecture.
