Best of Both Worlds: Achieving Performance and Scale for AI Workloads
Achieving system performance and scale for AI workloads has its challenges – it’s easy to get one or the other, but to achieve both is hard.
As we enter 2023, it’s becoming increasingly clear that AI and AI-powered products and services are quickly becoming mainstream. The incredible interest in generative AI tools like ChatGPT, Dall-E and others has thrust AI-powered computing efforts into a very bright spotlight.
As a result, there’s a great deal of new interest in AI. Organizations want to leverage the kinds of amazing capabilities that these generative AI tools have now shown are possible with today’s technology. Along the way, however, many are running into the familiar roadblocks that AI specialists have been cognizant of for many years: it’s hard to get the right combination of compute resources, storage systems and datasets working together in a productive, efficient way. That’s why DDN provides an AI Success Guide – to help AI-newcomers navigate these challenges.
In the blog “Avoiding AI Bottlenecks”, some of the key issues that impact AI workloads – notably the critical need for parallel file systems and storage devices – were introduced. But to really achieve the kinds of impressive results that generative AI-inspired data scientists and business professionals are now considering, it’s important to not only overcome basic limitations, but optimize for the best possible scenarios.
Today, that means designing algorithms and systems that not only meet the performance demands of the latest AI models but can also scale up to handle the enormous data set sizes that they require. Achieving cutting-edge performance and scale for modern AI workloads is a challenging task, to be sure, but there are infrastructure configurations that make it possible.
The process starts with first understanding the basic principles of how these systems operate as well as why many existing hardware and software components aren’t particularly well-suited to those needs. The basic idea behind AI algorithms involves two key steps that can be thought of as pattern recognition and pattern matching. First, in the pattern recognition part of the process, a large set of labeled data (meaning a human being has defined what it is) is fed into a mathematical model that calculates what the unique characteristics of that data are – a process called “training a model”. Those learnings are then codified into the algorithm. Second, in the pattern matching phase, a set of unlabeled data can be processed by this newly updated algorithm and identified as having characteristics similar to what it discovered in step one. This second step is called inference.
Most of the latest generation AI efforts are being enabled on what are called Transformer or Foundation models. While they are somewhat similar to other AI-based algorithms in that they are machine-learning based neural networks that generate an output based on an input request; what makes them different is their ability to learn the context and meaning of what they’re doing through a mathematical process called attention or self-attention. Basically, it means that instead of every data point used to train the model having to be labelled in advance by a human being, the algorithm can start to generate some of this context on its own. As the result, the training process can be faster and, potentially, more comprehensive. To be clear, it’s not actual “intelligence” and awareness of what it’s doing, but if it’s done well, it can be very human-like – hence the impressive capabilities of something like ChatGPT.
In order to do all this mathematical work, you need to have hardware systems that have massive computing power and fast access to huge amounts of data. Large language-based models, in particular, are very performance sensitive because to enable real-time conversations, huge amounts of calculations have to be done on the fly. On the compute side, high-powered GPUs continue to evolve and improve to meet the growing demands from these AI models. In addition, we’ve seen the introduction of dedicated AI accelerators from a number of companies including traditional chipmakers, as well as cloud companies. Like GPUs, these new AI accelerators are designed to perform a large number of calculations simultaneously – a critical requirement for these demanding algorithms. DDN’s A³I is the only storage system that can keep GPUs running at 100% utilization – and keeping the GPUs fed means more output in a shorter period of time.
Equally important as the massive compute power is fast access to huge amounts of data. Unlike traditional workloads that are powered by server CPUs, however, these AI-focused workloads and the specialized chips they’re running on place very different demands on storage systems. Existing enterprise storage systems, for example, can handle working with very large files or sets of data, but they are limited to single-lane serial connections into memory. Because of the types of calculations performed by AI algorithms, on the other hand, they typically need multi-lane connections to deliver the data in parallel to the different compute engines within the GPUs and AI accelerators.
Another challenge is that the network filesystems (such as NFS) that are used by traditional large storage systems are only designed to deliver data serially – in part because that’s how CPU-focused server systems operate. What’s needed is a parallel filesystem and hardware controllers within the storage system that can quickly respond to these simultaneous data requests. Specialized storage appliances from companies like DDN – which have created the EXAScaler parallel filesystem (based on the Lustre open source filesystem) – tackle this problem by integrating all the critical elements into a simple, rack-mountable device. Not only does this enable a performance optimized connection between a single compute system and storage, it also lays the groundwork for scaling these connections out as well. These types of connections allow for higher parallel throughput and better scaling for additional capacity as more compute nodes and storage appliances are connected together. Plus, an added bonus for those leveraging high-performance GPUs is that DDN’s storage system supports technology called GPUDirect Storage (GDS). GDS enables a direct path from storage to the high-bandwidth memory located on the GPU, which allows the GPUs to be used to their maximum potential and increases performance in the process.
Generative AI tools like ChatGPT and others that are likely still to come are inspiring renewed interest in AI applications across many fields. As these exciting tools demonstrate, properly tuned AI algorithms and the systems running them can enable impressive capabilities that are capturing the imagination of the general public. Putting together these systems in ways that offer the performance and scale these latest algorithms demand isn’t always straightforward. However, with appropriate planning and the right kind of data-first strategies, including AI storage, the possibilities and potential are fascinating.
Bob O’Donnell is the president and chief analyst of TECHnalysis Research, LLC a market research firm that provides strategic consulting and market research services to the technology industry and professional financial community. You can follow him on Twitter @bobodtech.