In HPC, it is really rare to get real-time, broad end-user feedback on whether the solutions you are building are addressing users’ top issues. With this in mind, I would like to share some of the more significant trends in our current user base and potential customers in HPC that we keep in mind when we’re thinking about how to solve emerging issues of large scale data. Hopefully you will add some to the list and I will come up with more and we can continue to discuss over more blogs to come.
- Even the compute guys agree – the Number one issue in compute is Storage IO
- Site wide file systems make more and more sense for multi-cluster sites
- Everyone has SSD caching layers now – they have to be smart
Trend 1 – The number one issue in compute is Storage I/O
Ok, so you are reading a storage blog, so you are probably a storage zealot already, but did you know the compute folks now agree with what you have been saying all along?
In the Intersect360 “Intersect360 Research ‘HPC Forecast: Vertical Markets and Economic Sectors‘”, which is updated every year, I was thrilled to read: “Another common thread spans all categories of Big Data applications: In a broad-based, end-user survey, both HPC- oriented and non-HPC respondents combined to rate I/O performance metrics… as the top technical challenges for big data workflows”. So, why am I so gleeful to have storage be recognized as the problem child? Because necessity is the mother of invention…. The more people who recognize this as a serious problem the more eyes we get on the different potential approaches for sustainable and cost effective solutions, ….and this is good for all of us.
Trend 2 – Site-wide file systems make more and more sense for the multi-cluster site
Whether you’re trying to improve performance, reliability and availability of research data like the Advanced Computing Center for Research and Education (ACCRE) at Vanderbilt, or to accommodate peak bandwidth of all systems on site with only one storage resource instead of 5 or 6 like TACC or NERSC, a site wide file system can make a lot of sense.
The basic idea here is sites with multiple compute clusters are looking for alternatives to buying, installing and administering storage for each cluster. They’re moving to one central storage pool with enough performance to handle the peak load of the top application on their fastset system. With this level of capability the other systems can also be served by the same storage resource given that it can handle mixed IO really well… which brings us to trend 3….
Trend 3 – SSD caching layers have to be Smart
Now that basically every player in HPC has a SSD tier story in storage, let’s take a look at what it takes to have a successful one. Cache matters in HPC because, as much as everyone likes to think they have lots of large i/o, the truth is, for many large HPC sites over half their volume is actually pretty small I/O. Add to this their tendency to have lots of concurrent processes and users, and basically more stuff than not starts to look and act like small I/O. So, you get the picture – small I/O is hard, caching is good, etc.
So, it is no longer enough to have some bolted on write cache, people want their cache to work hard on real-world problems every day AND to be able to take advantage of SSDs as a way to expand caching. What if your SSD layer could talk to your storage cache and just be an extension of it?
What if your SSD caching layer knew what your applications were up to and could anticipate what data they needed? These things are happening today – check out ReACT and SFX from DDN, what’s not quite here yet (thought there are technology previews) are even cooler things can be done with a caching layer if you look out a little in the future. For example, how about a caching layer that sits between compute and storage – offloading I/O compute cycles and aligning data before it is written to spinning media. An approach like this can give back a big percent of compute power currently wasted on IO overhead, and get a lot more performance out of existing storage.
In the next blog on storage IO trends, …. Storage is too big to fail, but also too big to back up – big data strategies need to take an innovative look at how to replace backup with something more big-data-y… and any other trends you want to discuss.