• DDN | BLOG
    • THINK BIG
    • INSIGHT & PERSPECTIVES ON
      LEVERAGING BIG DATA

The issue is clear: the adoption of machine learning is taxing underlying data access and management infrastructure.

Prototypes and first generation infrastructure are typically built on whatever Enterprise storage the organization uses, or the team building it decides to roll their own with a white box and a mix of open source, home grown, and commercial tools and applications. A successful machine learning program, however, usually runs into a problem of scale. In general, the more data it can incorporate, the better the results will be. This pushes the successful machine learning project to grow and grow.

When this happens, the generation-one infrastructure starts to stress. Scaling failures start to show up in a variety of ways – from the inability to deliver data access at the required speed to the inability to scale the amount transformation on the data to improve findings to the inability to scale data storage in a footprint that’s easy or cost-effective to manage, and so on.  Any one of these failures can derail advances of the overall program because if you can’t grow your inputs or increase the depth of your deep learning network, you can’t scale your outputs.

At this point, the project is in trouble, and a complete re-tooling is needed. As you can imagine, for commercial projects, the expense is usually bigger on the opportunity side—where time lost is a gift to the competition—than on the capital and manpower side. To try to avoid re-tooling some programs, try side-by-side silos. They copy the non-scaling architecture and point half their computation at the new gear, but now they have two environments to manage, and any inputs or outputs that need to be shared are now double the storage space and double the management and also add a lot of latency to job completion when results from silo 1 need to be used as inputs in silo 2.  This, of course, just gets worse when the program needs to add in silo 3.

The most successful projects we have seen that avoid this re-tooling period are the ones where infrastructure owners take the time – usually near the end of the prototyping phase – to think about various potential data scenarios: low, expected, and high requirements for data access, computation, retention, and protection.  A relatively small planning effort can save a lot of time and money, and it just makes sense to plan for success.

The larger the project, the more likely it is to need a storage system that can scale to large capacity, along with high-density building blocks to reduce the number of devices and data-center overhead as the project grows. The system should be able to deliver performance at scale, meaning it will need to show it can work well at projected scale with technologies like parallel file systems and flash. For future-proofed infrastructure, the system also needs to be able to handle data that is older or colder, simply and cost-effectively, and to have a clear path to future technologies like new flash formats and flash-native tools that maximize flash performance while avoiding flash-specific performance and longevity hurdles.

So, look for scalable, high performance storage that has intelligence built in to manage flash and active archive. Look for something that can handle your medium- and high-potential outcomes in a system that can grow performance and capacity without needing side-by-side silos. And look for suppliers that can offer you visionary technology for the flash era – after all, your machine learning program is going to be hugely successful, so plan for it.

  • Laura Shepard