DDN’s burst buffer technology, known as IME® (Infinite Memory Engine) garnered a lot of attention at SC16 and in the press in the past few weeks, and I wanted to take a few minutes to discuss why. New customers, benchmarks, live demos, and the recently announced availability of IME Version 1.0 helped generate interest, but the timing is also right for burst buffers.
Burst buffers are not a new idea, but working implementations are – and they are evolving to address performance needs that weren’t even identified, let alone well understood, until very recently.
We push computing faster and data bigger to achieve specific goals – to model complex systems and predict outcomes with more accuracy – to deliver better answers, faster. As we push, we break things – we break our own tools, and they need to be changed, worked around, or discarded to solve the next level of problem. The rate of increase in computational power is the engine driving these changes, and the speed of that rate of change dictates how fast we need to innovate all the pieces around it to keep it fed.
Burst buffers were originally conceived to absorb peak data creation from large, single computers as a cheaper alternative to buying enough traditional storage to meet peak needs. Since then, they have evolved, adding capabilities to address new needs that have sprung up as real computational power has grown.
Infinite Memory Engine (IME) is DDN’s burst buffer. It is also a software-defined application-aware I/O accelerator that manages a tier of fast storage – currently NVMe – that resides between compute nodes and primary storage.
This layer of fast storage is cost effective in absorbing a huge amount of I/O, so sites running a single, large system or multiple, large systems don’t have to overspend on traditional storage to get enough performance to absorb the site’s I/O peaks.
Once the data is in the burst buffer, however, it still needs to get into persistent storage. This challenge occurs because these sites rely on parallel file systems whose performance degrades rapidly when they are asked to handle a huge number of small requests all at once.
Large sites turn to parallel file systems to get more performance from more nodes when compared to the use of traditional file systems. The problem is that the same POSIX rules that apply to traditional systems also govern the use of parallel file systems, and this generates latency.
For example, there must be a locking capability that uses a number of steps to make sure that one process does not overwrite the data for another process. However, whenever multiple processes are accessing the same data, a latency buildup results.
This contention becomes a major problem when working with file systems at the exascale level or beyond the current performance levels possible with standard and parallel file systems. We have demonstrated at a number of conferences that when you throw a lot of fast, small I/O at a parallel file system, you end up reducing performance to a very small percentage of its peak capability because the system is dealing with all the requested operations.
IME helps out in this situation. Data from the compute side is written anywhere within the very fast data tier. The burst buffer holds the data and aligns it so it can be written to the parallel file system in the most optimum way, ensuring that the file system receives large, well – formed I/O rather than a lot of little requests that thrash the file system and slow its performance. This results in much higher performance from the parallel file system when working behind IME instead of directly attached to compute.
Using this technique, writes in a parallel file system such as Lustre* or IBM’s Spectrum Scale File System (formerly GPFS) can be anywhere from 10 to 1000 times faster than a system without IME. This same approach can be applied to other interfaces like MPI-IO and Fuse today and other types of storage – like object – future.
The performance gains using the IME technology allow the burst buffer to serve as an application optimizer tool. It can also be used as a “core extender” to achieve maximum performance from data sets that are too large to fit in the computer cluster’s main memory. In this scenario, data is accessed from the burst buffer rather than primary storage – the burst buffer is used to accelerate both reads and writes.
As systems continue to grow and the data sets they are working with get larger, IME’s core extender capabilities will become increasingly important. Supercomputers now exist that can process massive amounts of data for a single model. To get the kind of granularity we want for modeling fundamental physical systems, researchers will increasingly find they are using data sets too large to fit in system memory.
We will want to compute a lot of the data together so we can become more precise in the problem resolution. It will take too much time to batch through data sets in smaller pieces, but we also do not want to incur the latency associated with going out to spinning media to compute every I/O for the problem. Instead, we can pull the data into the burst buffer and compute it from this much faster media, which is far more proximate to the processor. This use case, while not currently the largest use case for IME, will quickly gain popularity.
The use cases for burst buffers with extended capabilities like IME are growing beyond traditional HPC. Sites across upstream oil and gas processes like reverse time migration, financial services backtesting, close-to-real-time analytics and large-scale operations workflows have IO characteristics which can benefit from IME.
Competitively, IME has a pretty thin field to run against. Only a few burst buffer projects exist worldwide, and less than a handful are ready for consideration by production sites. Its broader feature set is certainly an advantage. In addition, IME is the only burst buffer that’s independent – not tied to any specific server or even storage manufacturer’s gear. Earlier versions of IME were tested by Oakridge National Lab, Texas Advanced Computing Center (TACC), and ICHEC, to name a few, and many early tests have published results presented at SC, ISC, Cray User Group, RICE Oil & Gas, and many specialized conferences. At SC16 two weeks ago, JCAHPC announced the fastest computer in Japan and published their IME results from that system. Weighing in at #6 in the Top500 and the Green500, Oakforest-PACS leverages ~2.5 racks of DDN IME14K equipment for 1.56 TB/s data access. As a point of comparison, other TB/s+ file systems require 10 to several 100x the gear.
More major sites will declare their IME usage this year with some truly spectacular performance numbers. And in general, the burst buffer will continue to gain its place as a standard layer within HPC during the next several years and has already starting to cross into the commercial side of computing. No other solutions exist or are on the horizon that fill the needs that the burst buffer addresses, needs that will only increase over time.