DDN BLOG

For those of you who do not already know me, I am Per. My name is Per Brashers, neither of which is phonetically spelled; more on that some other day! I am the Chief Architect for storage solutions at DDN, and recently came from overseeing storage at Facebook. On my DDN blog, I look forward to drilling into specific approaches around the scalability challenge.

As you may have seen, today we announced the DDN hScaler appliance. I wanted to take a minute with my first blog to make some observations about this announcement and what it means for customers.

A traditional Hadoop architecture, built on commodity compute with Direct Attached Storage (DAS), comes with a set of limitations and disadvantages that are particularly important to enterprise customers. In contrast however, DDN has made some optimizations that will benefit all customers alike.

Hadoop’s original hypothesis is “move the job and not the data.” This hypothesis relied on the fact that networks and centralized storage were significantly slower than local storage. But this is not necessarily true in all cases — especially when you’re talking about DDN. hScaler is built on top of the SFA12K-40, a storage array capable of providing 40 GB (yes, big B!) per second of sustained large block throughput. Compare that to a commodity node that provides in the neighborhood of 100MB/s throughput to its devices, and simple arithmetic shows that converged storage from DDN changes the whole paradigm.

When networks were in the 10 MB/s speed range and disks operated around 80 MB/s speed, an obvious mismatch occurred. Since that time, the physics associated with drives has not allowed drives to increase much in speed, but the improvements in silicon has increased processors, memory and networking speeds at a phenomenal rate. As 40 Gb/s and 56 Gb/s networks become ubiquitous and these networks become designed to sustain continuous throughput, it is clear that the trend away from a shared nothing approach will reduce overhead, waste, and ease management issues. Historically, this has happened in other parts of IT with great success.

Hadoop in particular has had a sordid life in the high-end enterprise. This is partly due to an unmanageable deployment method, causing IT departments to take a hands-off approach, and partly due to the fact that the majority of the work had to be done by a data scientist. This latter issue drives two behaviors: one is that the highly skilled data scientist has to become an IT support person in his or her spare time, and the other is that a typical Hadoop deployment must undergo a 6-month “investigation” phase as the data scientist works on both the problem of deriving business value, and of learning the ins and outs of becoming and IT support specialist. hScaler gets rid of this separation, providing management tools for easy support by IT and thus enabling the data scientist to work on the job they’re for: inventing new ways to find data that will aid the business.

The hScaler appliance also has built-in ETL tools to make data scientists more efficient at their jobs. Through the bundled in ETL tools and the IT-friendly design, enterprises do not have to worry about the typical 6-month delay to derive value.

Another feature that will be of interest to the data scientist and senior leadership is the radically new idea of separation of compute-only nodes from the DataNode. In its most simple form, the DataNode, which is also responsible for compute work, spends roughly 30% of its time on tasks that do not drive storage throughput efficiently. By handing these tasks to a ComputeNode the total data per hour that can be processed increases nearly in lock-step with the ComputeNode additions. This minor modification to system design also opens up the ComputeNode to provide much more complex transforms of the data in a parallel scalability method. The ComputeNode may also be used as an extraction/translation node, running tasks such as; in-memory databases, laying out transforms on columnar data, and even providing reporting engines for final output rendering. The possibilities are endless.

It is also important to note that the networks used in hScaler are purpose designed and tailored for low latency access to storage. On the back end is an InfiniBand FDR network for DataNode access to storage devices, which also provides low latency intra-DataNode communication. On the front end is a mixed 10/40 GbE network designed to provide the ComputeNode massive throughput to DataNode resources. Zero-memory copy policies and other tricks of advanced networking all combine to save precious CPU resources, leaving those resources to do ‘real’ work.

It all comes wrapped with DDN’s support experience and all the best practices that come with DDN’s years of success in the HPC market. This allows you to harvest the lessons learned in the labs for the betterment of your company. All in all, an exciting — and considerably less complex — way of using Hadoop to get the most from your enterprise data.

What are your thoughts on Hadoop? What are you looking for in a Hadoop solution? Please don’t hesitate to share below.

  • DDN Storage