Avoiding AI Bottlenecks
By Bob O’Donnell
While nearly everyone has heard about the impressive new capabilities that artificial intelligence (AI) and machine learning (ML) are enabling at a general level, most people don’t always dig into the specifics of what these technologies can do and how their efforts are achieved. Of course, part of the reason is that the inner workings of AI/ML can get incredibly complex very quickly, but there are some basic concepts that anyone can understand.
As discussed in the first segment of this blog series (“Making AI More Efficient”), some of the “magic” of AI really boils down to the ability to recognize patterns from a given set of data, convert those patterns into algorithms, and then use those trained algorithms to infer predictions or other worthwhile observations from a new set of input data. For example, one of the more interesting AI-powered applications is Natural Language Processing, or NLP, which is a technology that’s designed to understand the meaning of human language. To make it work, documents (or spoken audio) is broken down into individual words and the relationships between individual words and phrases are analyzed in a multi-step process that eventually leads to meaning and context.
The rules by which that analysis is done form the basis for a trained algorithm that, when it is presented with a new document or segment of spoken audio, can determine what is meant and, in some cases, respond accordingly. NLP technology can be used in things like digital assistants and online chatbots as well as sophisticated document processing applications used to generate stock market predictions from things like analyst reports, financial earnings, investor calls, government data and much more.
As you might imagine, an application like this financial prediction model requires an enormous amount of input data in order to start deriving meaningful conclusions. In fact, this is also the case throughout many AI or ML-powered applications. It also requires that many of these calculations in near real-time, which places enormous demands on the computing systems designed to run these workloads. One of the keys to meeting these demands is running millions of operations in parallel, a task for which GPUs and specialized AI processors are very well suited.
But to make these GPUs and other computing engines run as quickly and efficiently as possible, they need to be fed a continuous stream of data. It turns out this can be a real problem for many traditional network storage solutions — even those with what appear to be very fast throughput and data transfer rates. As a result, storage systems often become a serious bottleneck, preventing AI models from running as quickly as they could.
Building on the previous blog’s car analogy, it’s not just about providing the “fuel” (or data for AI models) en masse to the computing engine in order to achieve the best performance. You have to provide multiple streams of fuel to each of the different sub-sections of the engine, such as to each cylinder, in order to maximize the potential of the system’s operation. Now, for general purpose automobiles this level of refinement isn’t necessary, and such is the case for certain types of less complex and less time-sensitive AI models as well. But when it comes to high-performance computing (HPC) applications, details like this not only matter, they can make a huge difference.
The key “trick” of a parallel storage system is the ability to not only forward along large amounts of data very quickly, but to do so in discrete simultaneous channels. That way parallel computing-based systems can work as efficiently as possible and not get stalled waiting for data to be delivered to other parts of the system. In conjunction with GPUs, for example, a properly tuned and optimized parallel storage system, such as storage appliances made by DDN, can deliver multiple separate streams of data that can each be acted upon by different GPUs.
Traditional storage systems would read multiple data objects from across a rack of appliances, and then multiplex them together as a single stream over a single physical connection, and demultiplex to the GPUs, which can be a significant bottleneck in AI systems.
By contrast, true parallel systems such as DDN storage appliances read the data from multiple devices across the rack of appliances, then distribute the data streams simultaneously across multiple parallel network connections, at the full speed of the available network connections. In doing so, it allows the AI-based, GPU-powered computing systems to run as efficiently as possible.
Many people who approach the challenges of putting together sophisticated computing systems for processing large AI models often focus on the core engine. And it’s true that large numbers of superfast GPUs are great for helping with AI model performance, but if they can’t get simultaneous access to the data they need to analyze, the process of training models will be slow and inefficient. Only by leveraging parallel storage systems can you avoid the bottlenecks that often plague systems that try to use normal network storage for AI applications.
In the next chapter of the blog, I’ll be looking at how to achieve both performance and scale (the Best of Both Worlds) when it comes to architecting AI-focused systems