Blog

Empowering the AI Cloud: DDN’s New Reference Designs for NVIDIA HGX B300 and NVIDIA GB300 NVL72

In the relentless race of AI innovation, NVIDIA Cloud Partners (NCPs), the specialized cloud service providers powering enterprise AI at scale, are at the forefront, delivering high-performance GPU resources to customers worldwide. Companies like Vultr, CoreWeave, Nebius, Core42, and others such as Lambda and Crusoe are transforming how businesses utilize generative AI, large language models, and scientific simulations through flexible, on-demand cloud services. As these NCPs scale their offerings, their infrastructure must match the explosive demands of next-gen workloads. Enter NVIDIA’s Blackwell architecture, a quantum leap in GPU acceleration with the NVIDIA HGX B300 and the NVIDIA GB300 NVL72 systems. These powerhouses, featuring Blackwell Ultra GPUs, promise unprecedented performance with up to 144 PFLOPS FP4 and 72 PFLOPS FP8 in HGX B300 systems, massive 2.1 TB HBM3e memory per HGX rack, and exascale capabilities in GB300 NVL72 racks boasting 72 Blackwell Ultra GPUs, 37 TB fast memory, and 130 TB/s NVLink bandwidth, all while delivering superior energy efficiency.

But compute alone isn’t enough for NCPs serving diverse customer bases. To unleash Blackwell’s full potential and ensure seamless service delivery, you need storage that can feed Blackwell GPUs at full speed, maximizing GPU utilization and delivering petabytes of throughput without bottlenecks, while supporting multi-tenant isolation, rapid provisioning, and cost-efficient scaling.

Today, we’re thrilled to announce two new DDN AI Factory Solutions Reference Designs (A3I) for NVIDIA Cloud Partners (NCPs): featuring NVIDIA HGX H200, B200 and B300 platforms, and NVIDIA GB200 NVL72 and GB300 NVL72 systems. These fully validated and optimized High-Performance Storage (HPS) blueprints are tailored specifically for Blackwell Ultra deployments in cloud environments. Drawing from our battle-tested integrations, these designs integrate DDN’s AI400X3 appliances and Insight software stack with NVIDIA B300 and GB300 NLV72 systems. They’re not just theoretical, they’re proven at scale, powering over 1,000,000 GPUs worldwide and enabling NCPs like CoreWeave and Nebius to offer reliable, high-availability GPU clouds to their customers.

Whether you’re an NCP like Vultr building a 1,000-GPU starter cluster for emerging AI startups or Core42 scaling a massive 41,000-GPU cloud region for hyperscale enterprise clients, these reference designs ensure predictable performance, seamless scalability, and rock-solid reliability, tailored to the unique needs of cloud service providers delivering AI-as-a-Service. Let’s dive into why they’re a game-changer for NCP customers.

Why Blackwell Ultra GPUs Redefine Cloud AI and Why Storage Must Keep Pace

NVIDIA’s Blackwell isn’t incremental; it’s revolutionary, perfectly suited for NCPs offering differentiated GPU cloud services. The B300 GPU, integrated into HGX B300 8-GPU platforms, delivers blistering performance with 279 GB HBM3e memory per GPU, double the bandwidth of predecessors at ~8 TB/s, and enhanced FP4/INT8 throughput for demanding AI tasks. Meanwhile, the GB300 NVL72 racks, packing 72 Blackwell Ultra GPUs and 36 Grace CPUs, crank out staggering AI performance in a liquid-cooled, NVLink-connected behemoth designed for the largest cloud-scale AI, with up to 20 TB GPU memory and 576 TB/s bandwidth.

For NCPs customers, this means:

  • Massive Throughput Gains: Higher FP4 throughput and greater INT8 performance, enabling real-time handling of trillion-parameter models to attract premium AI customers.
  • Energy and Space Efficiency: Superior efficiency over prior generations, reducing your cloud’s carbon footprint while fitting more compute into denser racks, potentially ~150 kW per GB300 rack, to maximize revenue per data center square foot.
  • Unified Fabric: NVLink 5.0 interconnects GPUs at 1.8 TB/s bidirectional bandwidth per GPU, demanding storage that can feed data across this fabric without latency spikes, ensuring SLAs for customer workloads.

Yet, as NVIDIA’s own Enterprise Reference Designs emphasize, success for NCPs hinges on validated, end-to-end designs that support multi-customer isolation and elastic scaling. Without optimized storage, Blackwell GPUs idle, wasting cycles on I/O waits that can slash productivity by 50% or more, a classic case of GPU utilization, eroding customer trust and margins. DDN AI400X3 flips the script, architected from the ground up for Accelerated, Any-Scale AI in cloud environments. Our shared parallel architecture ensures every layer, from NVMe drives to containerized apps on Blackwell GPUs, operates in lockstep, delivering high-throughput, low-latency data at massive concurrency, with built-in multi-tenancy support.

Fueling Blackwell Ultra GPU Acceleration at Exascale: The DDN Advantage

DDN’s new reference designs aren’t off-the-shelf guesses; they’re co-engineered with NVIDIA, certified for HGX compatibility, and validated across cloud topologies to help NCPs like Vultr and Core42 deploy faster and serve more customers. At the core is the DDN AI400X3 appliance, a beast delivering over 1 TB/s read throughput per unit with linear scaling across racks. Paired with DDN Insight software, it provides intuitive management, predictive analytics, and seamless integration with NVIDIA AI Enterprise, NVIDIA Spectrum-X Ethernet, and NVIDIA Quantum-2 InfiniBand, ideal for provisioning isolated storage pools per customer.

Proven Certification and Deployment Blueprints

These designs cover flexible scales for NCP growth, from proof-of-concepts to production clouds:

  • 1,152-GPU (GB300 NVL72) or 1,024-GPU (B300 HGX) Clusters: Ideal for NCPs like emerging providers testing Blackwell waters, with multirail networking via LACP bonding to handle bursty customer demands.
  • 16,128-GPU (GB300) or 16,384-GPU (B300) Regions: Enterprise-grade for players like Nebius, balancing capacity (up to 100 PB raw) with redundancy for 99.999% uptime and dynamic resource allocation across tenants.
  • 41,472-GPU Mega-Deployments: Hyperscale ready for CoreWeave-scale operations, incorporating Hot Nodes for read caching and NUMA-aware optimizations to handle exabyte-scale datasets from global customers.

Every blueprint includes detailed storage sizing guidelines and networking topologies, ensuring plug-and-play deployment in your multi-tenant cloud environment.

End-to-End Optimizations That Maximize Blackwell Ultra GPU Utilization for Cloud Services

What sets DDN Ai400X3 apart for NCPs? It’s the deep, NVIDIA-tuned integrations that eliminate workflow friction and enable efficient customer onboarding:

  • Shared Parallel Architecture: True end-to-end parallelism from drives to GPUs, with redundancy and automatic failover for bulletproof reliability. Data flows at high throughput and low latency, keeping all Blackwell cycles productive, no more GPU starvation during peak customer usage.
  • Streamlined Deep Learning Pipelines: Concurrent execution across phases (ingest, train, checkpoint) via a unified file interface. Parallel training of neural network variants accelerates discovery by 5x, with zero data movement overhead—perfect for NCPs offering managed AI services.
  • Multirail Networking & LACP Bonding: Group multiple HGX/GB300 interfaces for aggregate bandwidth exceeding 800 Gb/s per node. Dynamic load balancing and health monitoring make high-performance fabrics simple to deploy and manage, supporting SDN for tenant isolation.
  • NUMA-Aware Client & Hot Nodes: Automatically localize I/O to minimize latencies, plus local NVMe caching for repeated reads (e.g., training datasets). This slashes network traffic, speeds checkpointing by 15x, and frees shared storage for critical ops, transparent to apps and users, reducing costs for your customers.
  • Insight-Driven Management: Rich metrics for cache utilization, predictive scaling, and scheduler integration, empowering NCPs to optimize data loading, forecast demand, and maximize ROI in real time across multi-customer environments.

In benchmarks, these features deliver 15x faster recovery cycles and up to 99% GPU utilization, turning your Blackwell investment into a revenue engine for services like those offered by Core42’s sovereign AI clouds.

Blackwell + DDN AI400X3: Scalable, Resilient, and Future-Proof for NCP Clouds

Upgrading to NVIDIA Blackwell isn’t just about speed for NCPs, it’s about building clouds that scale without limits, serving more customers with lower TCO. DDN AI Factory solutions reference designs flex from rack-scale proofs to multi-region behemoths, with seamless capacity expansion and no replatforming required. Lower TCO through energy-efficient designs, simplified ops via turnkey validation, and maximized ROI by ensuring every Blackwell GPU hums at peak efficiency for your end-users.

Trusted in NVIDIA’s flagship deployments and now tailored for NCPs like Vultr, CoreWeave, Nebius, Core42, and beyond, these blueprints bring supercomputing-grade tech to your cloud: the same stack powering the world’s top AI factories, now optimized for your Blackwell journey and customer-centric services.

Ready to supercharge your NVIDIA Cloud Partner infrastructure? Whether you’re plotting a B300 HGX refresh or a GB300 NVL72 rollout to expand your offerings, DDN’s experts are here to guide you. Contact us today to download the full reference designs, schedule a deep dive, or explore custom sizing for your deployment. Let’s build the AI cloud of tomorrow, together.

How do DDN’s reference designs improve GPU utilization in Blackwell deployments?

DDN’s AI400X3 architecture delivers high-throughput, low-latency storage to keep Blackwell GPUs fed, eliminating I/O bottlenecks and reducing GPU starvation. This ensures up to 99% GPU utilization, even under multi-tenant cloud workloads.

What role do Hot Nodes play in scaling GPU acceleration?

Hot Nodes provide intelligent NVMe caching close to compute, reducing network traffic and accelerating repeated reads, which speeds checkpointing by up to 15× for Blackwell GPU clusters.

How do these designs address GPU starvation in large-scale clouds?

By combining parallel file access, NUMA-aware clients, and multirail networking, DDN ensures Blackwell GPUs receive data at line rate, preventing starvation and maintaining predictable performance for all tenants.

Are the DDN reference designs validated for NVIDIA HGX platforms?

Yes. These designs are co-engineered with NVIDIA, certified for HGX compatibility, and validated across cloud topologies including B300 and GB300 deployments.

Can these designs scale from small GPU clusters to hyperscale deployments?

Absolutely. DDN’s blueprints support everything from 1,000-GPU starter clusters to 40,000+ GPU mega-deployments, allowing NCPs to scale GPU acceleration seamlessly without rearchitecting storage.

Are the DDN reference designs validated for NVIDIA DGX SuperPODs? 

Yes. These designs are co-engineered with NVIDIA, certified for DGX SuperPOD compatibility, and validated across cloud topologies including B300 and GB300 deployments. 

Last Updated
Oct 28, 2025 8:33 AM