NLP and AI: The Role of Data Storage in Powering Generative AI Tools
Rarely has a tech phenomenon caught on quite as quickly as the new batch of generative AI tools such as Dall-E and ChatGPT. Seemingly overnight, these fascinating new applications are being used by a wide range of people for a diverse set of applications—from original content creation, to email and report writing, software development, content research and much more.
The text-based nature of ChatGPT, in particular, has proven to be particularly applicable to many situations thanks to its impressive level of language understanding. It’s ability to synthesize content and seemingly even knowledge from the simplest of inputs has given it an aura of magical capability. Of course, it’s really just sophisticated math crunching on enormous data sets that enables the output ChatGPT generates, but the results are remarkable, nonetheless.
ChatGPT and the algorithms behind it are part of a larger area of AI research known as Natural Language Processing, or NLP. The goal of NLP is to create tools that can understand the context and meaning of not only individual words but phrases, sentences, and even entire paragraphs. With that capability, a properly trained AI model using NLP principles can respond to a common language-based request (in either text or even audio form) with a cogent, intelligently organized (and, hopefully!) accurate reply. Most importantly, it can build its response from an enormous base of information, theoretically incorporating as much knowledge as is practically possible.
To achieve this, the machine learning algorithms that enable these capabilities first have to be trained by having large numbers of documents and other information sources (websites, books, reports, etc.) fed into them. From a compute and systems perspective that’s a big task as it requires huge amounts of data storage for AI, large numbers of compute engines and speedy, efficient connections between them all to function.
At a basic operational level, the training process entails analyzing all these various documents, breaking them into component parts, discovering common patterns across these elements, and then developing mathematical models to follow those patterns. In addition, because these models are built in an iterative matter, they need to be able to learn and extend the model as more data becomes available, or when we want to develop a language model for a specialized vocabulary such as scientific research or finance.
Large language models like the ones powering ChatGPT need to read and process their source datasets from storage into memory many times in order to build up the knowledge and understanding of nuances and meanings. The process may take weeks, or months, tuning the billions of parameters to refine the model. And as a critical part of the process, they will need to save checkpoints of the parameters along the way.
In practical terms, these type of data processing workloads for AI training put very specific demands on the storage systems (and compute engines) upon which they run. First, of course, you need enormous amounts of storage capacity—typically measured in petabytes—to work with all this information. You need to be able to read the source datasets at very high speeds, and write out parameter checkpoints as quickly as possible. And with billions of parameters being loaded and stored at every stage, you cannot afford congestion on I/O links, and you want to ensure there are no bottlenecks along the data path in either direction.
Translated into specifics, that means you need the fastest possible parallel storage system, matched to high-speed multi rail networking, and immense compute arrays. Speedy Solid State Drives (SSDs) are essential in these types of application. It is possible to combine with traditional spinning hard drives, which currently offer the highest capacities and the cheapest price per bit, but new SSD technologies are continuing to become more competitive.
DDN has combined these various capabilities into storage systems that have been specifically designed with high-performance flash storage, optimized for these types of AI workloads. They also offer hybrid SSD/HDD systems including the concept of Hot Pools, which can avoid the inefficiencies of “tiered” data storage and move data between SSD and HDD transparently. The intelligent storage controllers and parallel file system technologies that DDN integrates into its products, in particular, are what make them uniquely suited for these types of efforts.
As important as model training is for NLP applications, however, it’s only half the story. The other half involves inferencing, where the trained models can react to an input/request that someone types into it and then generate an output. Not surprisingly, the requirements and system demands for this type of workload are different than for training. For inferencing, the emphasis is much more on the read performance of an AI storage system as it needs to be able to apply the model with billions of stored model parameters to come up with the best response. In addition, because of the way these models work, multiple parallel compute each need simultaneous access to the same set of data. Once again, this means that for the best possible performance you need a storage system that offers parallel data delivery paths, a characteristic that DDN has specifically designed into its AI storage systems.
The capabilities of NLP-focused tools such as ChatGPT have shone a light on how far AI algorithms and software have come over the last few years. In fact, they are a great example of science fiction author Arthur C. Clarke’s famous quote that “any sufficiently advanced technology is indistinguishable from magic.” Behind their magic, however, is actually an enormous amount of computation, clever software and moving data. Making those pieces come together isn’t easy, but with the right tools appropriately designed for the task, it’s now clearly possible.
Bob O’Donnell is the president and chief analyst of TECHnalysis Research, LLC a market research firm that provides strategic consulting and market research services to the technology industry and professional financial community. You can follow him on Twitter @bobodtech.