DDN BLOG

Lustre* has emerged as a leading parallel file system that supports high-performance workflows and applications. Lustre has been designed with one assumption in mind: how to take the most performance advantage from rotational devices through a parallel file system implementation. For years, this strategy brought the application I/O performance to its peak, making it possible to solve computational problems previously impossible.

With the adoption of new non-volatile memory based (NVMe) devices, the general HPC storage community has started to investigate how to leverage and maximize the traditional parallel file system implementation using SSDs or NVMe devices. A wide range of different implementations have been recently unveiled, including DDN’s transparent cache layer: IME®, a burst buffer implementation that is co-located between the parallel file system name space and the compute clients, providing extreme fast access and transparent cache management.

Although the development of flash-based solutions, such as the burst buffer, is the current trend and probably the fastest and largest market adoption, some fundamental problems still rely on traditional ways of managing data, especially in the HPC arena. Several reasons contribute to this, including the fact that complex workloads still rely on old I/O techniques that require time to be ported and/or adapted to run in a non-POSIX file system.

The question that came to mind some time ago, when the eruption of SSD devices invaded the storage market, was how to leverage it quickly. Many ideas have been transformed into projects on different levels: IME, SFX, and L2RC.

Design Principles

The design goal of L2RC is quite simple. We wanted to define a transparent cache layer that would be persistent, hardware agnostic, transparent, somehow programmable, and managed by Lustre. This cache should allow customers to pin objects to a read cache pool and keep it persistent until manually purged. In the future, a write cache could be potentially implemented.

To make it flexible, L2RC should not be able to distinguish the type of media that was configured as a cache layer, thereby providing the ability to use SSDs, NVMe, or even standard disks (without much additional benefit) as persistent cache. In the future, L2RC may even be able to be implemented on NV-DIMM technologies. Having said that, it’s natural that being able to deploy it on any type of device will make it also hardware agnostic; however, the way that each storage vendor provides access to media will drastically change, thereby impacting performance and reliability. Ultimately, users should be able to define explicitly which objects or object characteristics should be cached.

We also considered the question, “Why not deploy an SSD-based OST? Wouldn’t that be the same?” We determined that no, the results would not be the same. More specifically, it would be extremely complex to direct on the data management side, requiring more than the capabilities provided by the de-facto standard Lustre policy engine, Robin Hood.

Implementation

The current implementation focus is on read-only cache. Write cache may be potentially supported in the future, once the read cache implementation is finished.

The cache is extensible, meaning it can be as large as the number of SSDs aggregated. This feature can be manually configured for a given set of objects, but it turns out to be even more interesting when combined with other features that can automate the workflow. For example, File Heat (to be discussed separately) is a framework that maps an object’s frequency access in memory. With a police engine mechanism, the File Heat map could be used to prefetch or remove objects from the L2RC cache line.

Use Cases

Many use-cases justify this type of caching mechanism. Having in mind the first development phase as read-only caching, applications such as RTM (reverse time migration) for the oil and gas industries, simulations based on Monte Carlo methods in general, and WORM (write once, ready many) workloads are just a few examples that could benefit from global read-only cache.

Performance preview

Preliminary benchmarks and tests that have been done with HDD-based OST vs. OST with L2RC are showing some impressive results. Basically, L2RC improves read performance of variable I/O sizes by up to 10X when getting the workload reading from the cache layer. The plot below shows the effectiveness of L2RC when varying the block size for a given workload. We compared the same benchmark on purely rotational OSTs against the OSTs with L2RC cache.

Read Performance L2RC

The initial benchmarks show significant improvements from I/O sizes ranging between 2K and 16K. After 16KB, the performance flattens out but is still significantly higher than OSTs without L2RC cache.

Another interesting comparison is against SSD OSTs. L2RC-based OSTs deliver almost the same as pure SSD OSTs, with the advantage of being completely transparent to the applications not requiring data movement from SSD to rotational disk OSTs.

4kb Random Read L2RC

L2RC will be available in future Lustre versions, most likely beginning with 2.10.x. DDN will push all the patches upstream, probably in 2018, to ensure that the Lustre community has access once it is ready for distribution.

  • Carlos Aoki Thomaz