Blog

Why AI Performance Starts with Data: How Salesforce Eliminated GPU Bottlenecks with DDN and Google Cloud

Artificial intelligence has entered a new phase. It’s no longer about experimenting with models; it’s about operationalizing AI at scale. But there’s a hard truth many organizations are now confronting:

Your AI system is only as fast as the data feeding it.

Even the most advanced GPUs can’t deliver results if they’re waiting on storage. That’s exactly the challenge Salesforce faced as it scaled large language model (LLM) training in production.

The Hidden Bottleneck in AI Infrastructure

Salesforce was training a Llama 3.1 (8B parameter) model using high-performance GPU clusters. On paper, the infrastructure was world-class. In practice, performance told a different story:

  • GPUs were underutilized; sometimes dropping to 40% utilization
  • Training jobs were slowed by high I/O latency
  • Teams were constantly tuning storage just to keep systems running

This is the defining issue of modern AI infrastructure:

The data layer—not compute—is the primary constraint on AI performance.

A Different Approach: Data Intelligence at the Core

To solve this, Salesforce turned to Google Cloud Managed Lustre, a fully managed parallel file system powered by DDN EXAScaler. Through Google Cloud Managed Lustre, powered by DDN EXAScaler, organizations gain:

  • Proven high-performance infrastructure for AI and HPC
  • Fully managed deployment with enterprise-grade scalability
  • Optimized data pipelines that maximize GPU efficiency

This combination enables teams to:

  • Train faster
  • Reduce cost per token
  • Move from pilot to production without re-architecting

This wasn’t just a storage upgrade; it was a shift in architecture. Instead of treating storage as a passive system, Salesforce deployed a data intelligence platform engineered to feed GPUs at line rate. Integrated with their Google Vertex training cluster, the solution delivered:

  • Massively parallel throughput at scale
  • Seamless deployment with minimal operational overhead
  • A fully managed environment optimized for AI workloads

Once the data bottleneck was removed, the results were immediate and measurable:

  • 75% reduction in I/O latency
  • 1.5× faster model training
  • 70% increase in GPU utilization
  • 42% reduction in total training costs

These aren’t incremental gains; they redefine what’s possible.

DDN POV: AI Is a Data Problem First

At DDN, we’ve been consistent on one point:

AI infrastructure is not a compute problem. It’s a data problem.

The industry has spent the last several years optimizing GPUs. But as models scale and workloads become more complex (training, inference, RAG, agentic pipelines), a new reality has emerged:

  • Data pipelines are more complex than model architectures
  • Latency, not FLOPS, is the gating factor for performance
  • GPU utilization, not GPU count, determines ROI

Traditional storage systems were never designed for AI. They struggle with:

  • Massive concurrency
  • Small file access patterns
  • Continuous streaming for training and inference

This is why DDN defines a new category: Data Intelligence for AI. Because in real-world deployments, you don’t win by having more GPUs. You win by keeping them fully utilized, all the time.

DDN EXAScaler, the engine behind Google Managed Lustre, was purpose-built for this reality:

  • Parallel file architecture at extreme scale
  • Deterministic performance under load
  • Throughput that matches GPU demand curves

This is the difference between supporting AI workloads and powering AI factories.

The Bigger Shift: The Rise of AI Factories

This isn’t just a single deployment story; it reflects a broader transformation.

Enterprises are moving toward AI factories, where:

  • Data and compute are tightly integrated
  • Workloads span training, inference, and RAG
  • Systems are designed for continuous, production-scale output

And in this model: The data platform is the factory floor. If it’s slow, everything slows. If it scales, everything accelerates.

Final Takeaway

One of the most important outcomes for Salesforce wasn’t just performance. It was focus.

With Google Managed Lustre powered by DDN:

  • Infrastructure tuning and troubleshooting disappeared
  • Engineering teams shifted from ops work to innovation
  • AI pipelines scaled seamlessly from training to inference

AI success is no longer defined by model size or GPU count. It’s defined by how efficiently you move and manage data. Salesforce’s experience reinforces what we see across the market:

When you fix the data layer, everything else follows.

Salesforce moved from managing systems to building the future of AI.

To learn more, please view the case study at https://www.ddn.com/partners/googlecloud/.