There is a lot more to unleashing the power of Artificial Intelligence (AI) and Deep Learning than throwing a bunch of GPUs at the problem. Building the right machine learning environment takes foresight and planning to ensure that future requirements won’t slow down your mission, and move you from data science to actionable business impact.

Storage Isn’t Just About Holding Data

Applying the correct storage platform to your AI project is essential to truly maximizing its business value. Ultimate value is only achieved when the entire infrastructure – applications, compute, containers, networks and storage – work in harmony to exploit the available data, at whatever scale. A major limiting factor can be I/O bottlenecks. Legacy storage solutions were not designed for the low-latency, highly parallel, mixed workload requirements found in many stages of the AI data lifecycle. During early development and with limited training data sets these bottlenecks may not become quickly apparent, but the true promise of business impact is when these projects can be applied to vast amounts of data. Customers have found that short term solutions, like local or general purposed shared storage, are major limiters when it comes to scaling. They hit a point where it is either no longer manageable or the GPU based systems are essentially idle due to the lack of I/O performance, and the project requires a costly re-architecture, both in terms of time and capital.

So what are the storage considerations to ensure rapid scaling and greatest value?

  • Saturate your AI platform: Many projects initially address the up-front enabling power of GPU-based compute systems, but neglect the storage side. The correct storage platform will ensure that GPU cycles don’t remain idle due to applications waiting for the storage to respond.
  • Ingest capabilities to match future needs: Gathering data into a central repository will be a critical factor in creating a source that the Deep Learning model can run against once it is ready for production. Collecting data into this repository will require the ability quickly ingest information from a wide variety of sources. Ingest for storage systems means write performance and coping with large concurrent streams from distributed sources at huge scale, in addition to the ability to scale capacity.
  • Flexible and fast access to data: As an AI-enabled data center moves from initial prototyping and testing towards production and scale, a flexible data platform should provide the means to scale in any one of multiple areas: performance, capacity, ingest capability, Flash-HDD ratio and responsiveness for data scientists. Such flexibility also implies expansion of a namespace without disruption, eliminating data copies and complexity during growth phases.
  • Start small, but scale simply and economically: Scalability is measurable in terms of not only performance, but also manageability and economics. Successful AI program can be designed to start with a few TBs (terabytes) of data, but must be able to easily ramp to multiple PBs (petabytes) without re-architecting the environment. One way to scale economically is to optimize the use of storage media depending on workload. AI platform architects should consider tightly integrated, scale-out hybrid architectures designed specifically for AI.
  • Partner with a vendor who understands the whole environment, not just storage: Delivering performance to the AI application is what matters, not how fast the storage can push out data. The chosen storage platform vendor must recognize that integration and support services span the whole environment, beyond just storage, deliver results faster.

Attending NVIDIA GTC 2018?

Meet with the DDN team at this year’s NVIDIA GTC conference to discover how DDN can help you support your machine learning and artificial intelligence initiatives and address your data growth in the most adaptable and cost-efficient manner possible. Click here to arrange a meeting.

  • George Vacek
  • George Vacek
  • Global Business Director - Life Sciences, Healthcare and Machine Learning
  • Date: March 20, 2018