The issue is clear: the adoption of machine learning is taxing underlying data access and management infrastructure.

Prototypes and first generation infrastructures are typically built on whatever Enterprise storage the organization uses—or the team building it decides to roll their own with a white box and a mix of open source, home grown, and commercial tools and applications. A successful machine learning program, however, usually runs into a problem of scale.

In general, the more data that machine learning can incorporate, the better the results will be, pushing a successful project to grow and grow. Unfortunately, when this happens, the first generation infrastructure begins to stress, and scaling failures start to show up in a variety of ways—from the inability to deliver data access at the required speed to the inability to scale data storage in a footprint that’s cost-effective and easy to manage, and so on. Any one of these failures can derail advances of the overall program because if you can’t grow your inputs or increase the depth of your deep learning network, you can’t scale your outputs.

At this point, the project is in trouble, and a complete re-tooling is needed. As you can imagine, for commercial projects, the expense is usually bigger on the opportunity side—where time lost is a gift to the competition—than on the capital and manpower side. To try to avoid re-tooling programs, some people try side-by-side silos. They copy the non-scaling architecture and point half their computation at the new gear, but now they have two environments to manage. This means that any inputs or outputs that need to be shared now require double the storage space and double the management. In addition, this approach adds a lot of latency to job completion when results from the first silo need to be used as inputs into the second silo. This, of course, just gets worse when the program needs to add in a third silo.

The most successful projects we have seen that avoid this re-tooling period are the ones where infrastructure owners take the time—usually near the end of the prototyping phase—to think about various potential data scenarios: low, expected, and high requirements for data access, computation, retention, and protection. A relatively small planning effort can save a lot of time and money, and it just makes sense to plan for success.

The larger the project, the more likely it is to need a storage system that can scale to capacity, along with high-density building blocks to reduce the number of devices and data-center overhead as the project grows. The system should be able to deliver performance at scale, meaning it will need to show it can work well at projected scale with technologies like parallel file systems and flash. For a future-proofed infrastructure, the system also needs to be able to handle older or colder data simply and cost-effectively, and to have a clear path to future technologies like new flash formats and flash-native tools that maximize flash performance while avoiding flash-specific performance and longevity hurdles.

So, look for scalable, high-performance storage that has the intelligence to manage flash and active archive. Also look for something that can handle your medium- and high-potential outcomes in a system that can grow performance and capacity without needing side-by-side silos. Finally, look for suppliers that can offer you visionary technology for the flash era—after all, your machine learning program is going to be hugely successful, so why not plan for it?

  • Laura Shepard