If you’ve ever seen published numbers for how much data is stored in enterprise data warehouses (EDW), it’s usually shockingly small. That is only because high storage costs make it difficult for most enterprises to keep an EDW with more than a small number (low tens of terabytes). The economics of old-fashioned enterprise storage are so skewed against the biggest of Big Data that many major enterprises feel they have to make sacrifices.

As a result, many enterprises don’t keep all the raw data originating with the operational systems but instead, select a carefully crafted subset/sample of the actual raw data via the ETL process. Unfortunately, this subset is not entirely useful. Let me explain.

ETL stands for Extract, Transfer and Load. The enterprise realizes it can’t afford to keep all its data, so it creates alternative approaches. This could include grouping the data into broad categories that help with analysis, or simply taking a random sample of the data, much in the same way that some pollsters use a sample of 2,300 people to predict how 125 million Americans will vote. In addition, the extracted data is loaded in a pre-defined, rarely flexible schema resulting from a data analyst coding required fields to get a question answered from the EDW.

This inflexible schema approach offers little to no freedom as to which question you might want to ask of your data to derive actionable insight. This leads to the common perception that an EDW query is primarily for verifying the answer to a known question as opposed to true data discovery, which is about figuring out the right question to ask at a moment in time. Schemas are also tied to the nature of structured data. In very rare cases can they deal with unstructured data, and when they do, it takes extra manipulation to get the desired results.

In addition, regardless of the algorithms you run, results derived from a subset of data are not as accurate as results derived from raw data.

These aspects of the ETL process can yield results so inaccurate that I prefer to offer a better definition for this process: Extract, Torture and Lose.

By extract, torture and lose, I mean ETL is essentially a destructive process because what you’re left with is an incomplete representation of the true data. What you can compute will only be as good as the data you have. You extract the data, you torture it, and then you lose your signal. And as a very smart analytics expert once pointed out to me, the more you torture your data, the more likely it is to tell you what you want to hear.

Transform ETL into FENL: Fast, Extract and Never Lose

DDN believes in the “power of raw.” We believe that Big Data means Big Noise but success in uncovering insights won’t happen by asking the same old questions from the same old models from the exclusive club of structured data.  We stand fast against data torture.

We more importantly believe that if we combine the “schema less” approach of Hadoop HDFS to store and archive all raw data, both structured and unstructured, we can transform ETL into FENL: Fast, Extract and Never Lose. This will empower the enterprise to become an actionable, insight-driven enterprise.

This is why hScaler, our enterprise-class Hadoop appliance, is not just about DDN being yet another Hadoop player. It’s about bringing to our customers an analytics infrastructure appliance built on our DNA of performance and scale.

To be continued….

  • DDN Storage
  • DDN Storage
  • Date: March 12, 2013