Mapping Data Lifecycles for Generative AI Applications

By Bob O’Donnell

As the world continues to grow in its fascination with all things related to generative AI, there’s been an increasing focus on the tools needed to build, manage and run the transformer AI models that underlie them. While much of the initial attention has been directed toward the enormous arrays of GPU-equipped servers that power their creation, there’s a growing recognition that sophisticated data storage and management tools are also critical to their success. An obvious but often overlooked factor is that one of the key enabling technologies for the rush of new AI applications is speedy access to far more significant amounts of data than previous AI applications.

Key shifts in operational workflows and data management for AI:

  1. The Workflows needed to run generative AI are immensely demanding and put a massive strain on the network and systems.
  2. Generative AI is more organizationally pervasive and the organizations doing the models are generally not experts in supercomputing. The systems must be simplified and lower risk.

In order to build things like the large language models (LLMs) at the heart of OpenAI’s ChatGPT and Google’s Bard, or the image databases powering Midjourney and Dall-E, companies need to compile datasets measured in petabytes, run analysis tasks on that data that start to build out the parameters that model will include, re-analyze those results multiple times over to reinforce the quality of the model output, and save the resulting algorithms as a trained generative AI model.

Challenges in building large language models:

  1. Assemble reference datasets for training.
  2. Select the best transformer model
  3. Run model training tasks on the dataset
  4. Iterate multiple times to fine-tune model parameters
  5. Save as a trained generative AI model

After that’s complete, live data must be run against the model to generate new output via inferencing. Along the way, there will also need to be a process to archive the input data, the output data the model produces, and any other related metadata associated with that inferencing work. The bottom line, it’s a long, sophisticated set of data-related tasks. To guarantee the best possible performance and reduce as much latency as possible requires thoughtful planning and powerful tools to manage and interconnect the various aspects of this data lifecycle.

STEP 1: Planning

Initially, companies may be tempted to use independent silos of different storage devices and procedures for these various steps— partly because few, if any, existing enterprise-focused storage systems are designed to handle this level of throughput. In the end, however, the most efficient way to handle these tasks is through careful planning that considers the specific demands and requirements of each step and a speedy, powerful storage solution optimized for these sophisticated types of workloads. Proper planning can help companies avoid the tedious, time-consuming task of copying data from one silo to another in order to complete the task. Plus, it allows for better data consistency across the entire project.

STEP 2: Data Preparation, Model Training

At a high level, thought has to be given to what will be needed for the model training process as well as the inferencing process. In the case of training, it’s not just a simple process of collection into memory and then storing onto disk. Prepping the raw data for ingestion into the model framework could involve multiple steps including formatting, filtering, labeling, data validation and more, most of which require fast, interactive access to the dataset. Once the learning process begins, it’s best to make regular backups or checkpoints of the model as it starts to build out the parameters that will define its operation. As part of the reinforcement learning process, selected datasets will need to re-run back through the model in order for it to create further refinements—an additional demand on storage systems.

STEP 3: Inference & Production

Once the model is created, the process shifts to inferencing, which is more operationally focused. Here, companies will need to be prepared to run a 24/7 data analysis workload in which new input data is run into the model and it, in turn, generates new output. To ensure that the model works as effectively as it can, the same type of filtering and data preparation that was used to prep the raw data for its initial ingestion into the model should stay consistent to process any incoming data. Then data archiving takes place for the input that enters the model and the output it generates for several reasons. First, ongoing analysis of the input and output data sets can provide continued feedback to and refinement of the model. In addition, leaders need to think ahead to potential auditing and compliance requirements for their models. To do those types of tasks, an audit data trail will be absolutely essential.

STEP 4: Data Management Strategy for Generative AI

Collectively, these steps demand a well-architected data strategy and data flow pipeline to ensure smooth, latency-free operation both for AI training and inference. In addition, each of the steps (and sub-steps) has different performance requirements, so a single solution typically won’t work. Instead, organizations need to think through these various demands and piece together a system that can ebb and flow through the various steps as needs arise. In addition, it’s important to think about future demands and requirements that may be placed on these systems, such as logging decision points related to explainability so that any future audits demanding this information can be prepared for.

Putting together an AI storage system that is prepared for all the various factors that generative AI – training AI models and running inference against them – demands isn’t easy. Toss in the need to plan for other potential future requirements and the task gets even tougher. Nevertheless, with thoughtful planning about the various steps involved, the performance specs that they require and, most importantly, how all these elements need to work together, it’s possible to create a solid solution.

Critically, companies need to get past the thought that several independent systems can be pieced together to help with these various steps. Though that might seem easier at first glance, the highly iterative, highly connected, and constantly evolving nature of these generative AI workloads really demands a single solution that can link these various stages of the AI data lifecycle together. Plus, the scale, capacity and throughput requirements of these different steps go well beyond the capabilities of most existing enterprise storage solutions, which weren’t designed with these kinds of requirements in mind. As with most things in life, it’s essential to find the right tools for the job in order to complete it as effectively and efficiently as possible.

Bob O’Donnell is the president and chief analyst of TECHnalysis Research, LLC a market research firm that provides strategic consulting and market research services to the technology industry and professional financial community. You can follow him on Twitter @bobodtech.

Go Back