Artificial intelligence has entered a new phase. It’s no longer about experimenting with models; it’s about operationalizing AI at scale. But there’s a hard truth many organizations are now confronting:
Your AI system is only as fast as the data feeding it.
Even the most advanced GPUs can’t deliver results if they’re waiting on storage. That’s exactly the challenge Salesforce faced as it scaled large language model (LLM) training in production.
The Hidden Bottleneck in AI Infrastructure
Salesforce was training a Llama 3.1 (8B parameter) model using high-performance GPU clusters. On paper, the infrastructure was world-class. In practice, performance told a different story:
- GPUs were underutilized; sometimes dropping to 40% utilization
- Training jobs were slowed by high I/O latency
- Teams were constantly tuning storage just to keep systems running
This is the defining issue of modern AI infrastructure:
The data layer—not compute—is the primary constraint on AI performance.
A Different Approach: Data Intelligence at the Core
To solve this, Salesforce turned to Google Cloud Managed Lustre, a fully managed parallel file system powered by DDN EXAScaler. Through Google Cloud Managed Lustre, powered by DDN EXAScaler, organizations gain:
- Proven high-performance infrastructure for AI and HPC
- Fully managed deployment with enterprise-grade scalability
- Optimized data pipelines that maximize GPU efficiency
This combination enables teams to:
- Train faster
- Reduce cost per token
- Move from pilot to production without re-architecting
This wasn’t just a storage upgrade; it was a shift in architecture. Instead of treating storage as a passive system, Salesforce deployed a data intelligence platform engineered to feed GPUs at line rate. Integrated with their Google Vertex training cluster, the solution delivered:
- Massively parallel throughput at scale
- Seamless deployment with minimal operational overhead
- A fully managed environment optimized for AI workloads
Once the data bottleneck was removed, the results were immediate and measurable:
- 75% reduction in I/O latency
- 1.5× faster model training
- 70% increase in GPU utilization
- 42% reduction in total training costs
These aren’t incremental gains; they redefine what’s possible.
DDN POV: AI Is a Data Problem First
At DDN, we’ve been consistent on one point:
AI infrastructure is not a compute problem. It’s a data problem.
The industry has spent the last several years optimizing GPUs. But as models scale and workloads become more complex (training, inference, RAG, agentic pipelines), a new reality has emerged:
- Data pipelines are more complex than model architectures
- Latency, not FLOPS, is the gating factor for performance
- GPU utilization, not GPU count, determines ROI
Traditional storage systems were never designed for AI. They struggle with:
- Massive concurrency
- Small file access patterns
- Continuous streaming for training and inference
This is why DDN defines a new category: Data Intelligence for AI. Because in real-world deployments, you don’t win by having more GPUs. You win by keeping them fully utilized, all the time.
DDN EXAScaler, the engine behind Google Managed Lustre, was purpose-built for this reality:
- Parallel file architecture at extreme scale
- Deterministic performance under load
- Throughput that matches GPU demand curves
This is the difference between supporting AI workloads and powering AI factories.
The Bigger Shift: The Rise of AI Factories
This isn’t just a single deployment story; it reflects a broader transformation.
Enterprises are moving toward AI factories, where:
- Data and compute are tightly integrated
- Workloads span training, inference, and RAG
- Systems are designed for continuous, production-scale output
And in this model: The data platform is the factory floor. If it’s slow, everything slows. If it scales, everything accelerates.
Final Takeaway
One of the most important outcomes for Salesforce wasn’t just performance. It was focus.
With Google Managed Lustre powered by DDN:
- Infrastructure tuning and troubleshooting disappeared
- Engineering teams shifted from ops work to innovation
- AI pipelines scaled seamlessly from training to inference
AI success is no longer defined by model size or GPU count. It’s defined by how efficiently you move and manage data. Salesforce’s experience reinforces what we see across the market:
When you fix the data layer, everything else follows.
Salesforce moved from managing systems to building the future of AI.
To learn more, please view the case study at https://www.ddn.com/partners/googlecloud/.