Intersect360 recently updated their HPC Budget Allocation Map and HPC User Site Survey for Storage, and subsequently, a Paper they wrote for DDN titled : “Searching and Researching: DDN Solution for Life Sciences”.   As I was reading these and other recent research I was reminded: Some things in HPC don’t change:

  • About $0.50 is spent on storage for every $1 on servers, and
  • DDN is still #1 in HPC Storage with 2x or more of end users identifying DDN as their primary storage vs EMC, NetApp, Hitachi or Panasas

Life Sciences, Genomics, Storage, HPC And then some things do: “…respondents … rate I/O performance metrics (bandwidth, latency, and IOPS throughput) as the top technical challenges for Big Data workflows. In addition, both groups rated I/O performance as the top “satisfaction gap” for storage – the area of the largest disconnect between importance and satisfaction with their current solutions”

When DDN surveys our own customers – mostly the ‘storage guys’ at HPC sites – we are not terribly surprised to see storage issues to be the top of mind, but when the ‘server guys’ are saying it too, well… this is new, and it is time to take a fresh look at how contributing factors are changing.

A great example of why storage performance is now the top concern for HPC sites is what is happening in Genomics.  Genomics is probably the poster child for big data, with a rate of increase in data per site that vastly outstrips other major disciplines.

Even ‘smaller’ sequencing operations with only a few people procuring and managing all hardware – from scientific devices to servers to storage – have petabytes of data, and are often looking at data doubling or even tripling within a year.    Some of the larger sites have similar growth rates, and have to work every day to solve data access, management and publication issues – often building the new infrastructures they need as they go. Lifecycle

(source: DDN / Sanger Case Study, 2014)

So, what’s the problem?  Getting the data in and out of the computer. Compute performance is getting faster at a much greater rate than storage media.  This means a growing performance gap that needs to be addressed by ever-cleverer approaches.  Increasing Memory sizes on the server side solves some of the problem, but where big data is concerned, it’s not enough.  For problems too big to fit in memory a massively parallel storage infrastructure and file system is actually faster than breaking a problem down and batching it through memory.  Combine this approach with intelligent caching and you arrive at a platform that can feed today’s high performance clusters.

The Intersect360 paper is a straightforward representation of big data problems in Life Sciences.  It incorporates a good deal of supportive data from extensive, vendor-neutral industry survey efforts and is also a darn good read, so check it out.

  • Laura Shepard