As businesses are eager to become more information driven in a world where the amount and the variety of data at hand is dauntingly rising but information technology budgets are at best flat, a new data storage concept has emerged called a “Datalake”
One of the leading drivers behind the genesis of Datalakes is simply cost; because Datalakes are generally implemented using the Apache Hadoop file system running on commodity hardware (or sometimes purpose built low-cost “big data” arrays), they enable massive amount of information to be stored at a very economically viable point as opposed to what could be achieve with traditional IT storage hardware.
While some of the information written about Datalakes  and the ensuing heated response  makes for entertaining and sometimes informative moments, it is my belief that beside cost, the truly important benefit that Datalakes bring to the “information powered enterprise” is missed or at best overshadowed. I would characterize this benefit as “High quality actionable insights”.
To better understand why I strongly believe this, one need to understand the role ETL has been playing in information management, business intelligence and analytics.
Those who are kind enough to follow me on social media or hear me speak at various information management industry events know that I often say “ETL is evil and the number one contributor to poor analytical results” & ETL does not stand for Extract, Transform & Load but Extract, Torture & Lose and that is the core of the issue.
- Extract: Pulling out raw operational data from databases (structured data) file systems for (unstructured data) in order to put it in a data warehouse for further analysis.
- Transform: “Massaging” the raw data in order for it fit an a-priori defined data model. Generally a database schema (structured data) or a file system with a predefined directory hierarchy (unstructured)
- Load: Storing the output of the Transform phase into the data warehouse or file system.
This ETL process looks pretty reasonable on the surface but unfortunately the transformation and loading phases are in reality phases of information destruction (bad) and loss (worse). It is data torture and loss!
Let me attempt to explain why. Of note, while I will focus mostly on structured data in my explanation, most of it is applicable to unstructured data too.
Historical data starvation: The storage cost of data warehouses has traditionally been so poor (generally close to an order of magnitude of the cost of commodity storage or purpose built storage) that it has lead IT to always minimize the amount of information imported and retained from the operation systems. This results in poorly performing analytics as a direct result of historical data starvation.
For example: Although a line of business would want to keep a year or more of say retail transactions, IT will only allow a quarter to be loaded and retained. When the first month of the following quarter is loaded, the first month of the previous stored quarter is erased. This form of economics-driven censorship means that any insight extracted from the data at hand will have no long-term backward looking trend or out of norm event detection capabilities.
Data destruction: A schema based data store has many issues such as rigidity, complexity and poor versatility. The mere fact that one has to know ahead of time which piece of incoming data should go into which field of the schema means that no variation or exception in the incoming data is well handled. This also assumes that the business has figured out the entire set of questions that it wants to ask of its data; an assumption which, in the world of Big Data, is ludicrous.
If the source data is suddenly richer, (maybe because the application creating that data has been enhanced or updated), this enrichment is likely to be “dropped on the floor” if the schema has not been updated; an unfortunate but very common case due to the inherent slowness and complexity of change management.
Storage economics constraints also leads to data destruction; many a time more than half of the incoming data is not incorporated in the schema as its meaning is not yet understood at the time of schema definition. If it is not understood it is omitted because IT can’t afford to hold and keep data that it does not know what to do with, as it would occupy precious storage real estate.
Ignorance led data transformation: Unfortunately almost no IT personnel in charge of database infrastructure and management has a data scientist background. This lack of expertise has dramatic negative effects during the transformation phase of any ETL process. The most common error is the elimination of a record that may have missing data in a given field; these records more often than not are dropped or worse set to NULL. The lack of understanding that missing data is meaningful to a data scientist is an issue. The same can be said for making arbitrary decisions on field content; I know the problem well! My own first name, “Jean-Luc,” is persona non grata within most databases as they do do not accept that the dash is part of my first name. Most of the time it is dropped and concatenated as “Jeanluc” or worse, assumed to be a middle name separator between “Jean” & “Luc”.
The list of torture instruments that are used against source data would make a Spanish inquisitor pale with envy!
The bottom line is that the traditional ETL process leads to incomplete or inaccurate data onto that analytics will be run and in turn will derive inexact actionable insights. This issue is greatly amplified with the arrival of Big Data and its variety, volumetry and velocity.
No respectable data scientist would feel confident that he or she obtained best analytics if working reduced, sampled or transformed data when that transformation is not under his or her purview. Give them raw data!
HDFS to the rescue and the true value of Datalakes
As mentioned earlier, Datalakes are for the majority of deployments built on HDFS. While increasingly popular as a foundation store for Hadoop-based infrastructures, HDFS in itself is no universal panacea for all information storage problems but it does have an interesting characteristic: it is a “schema-less data store”. Unlike traditional stores that force the data to fit a predefined schema, HDFS allows initial raw data extraction and persistence without needing prior knowledge of how and which piece of the data the data scientist will use. Because HDFS can be deployed on commodity storage or Big Data aware storage which economics are far superior to data warehouse storage, it is almost an ideal raw data storage and archive infrastructure. It does not force ETL from source data to the persistence repository and put the right ETL process in the hand of the data scientists.
As someone once said, “The more one tortures data, the more it will say what one wants to hear.” This paradigm must be broken as pre-supposed answers are of no value to any business.
Big Datalakes, in their current form, should probably be renamed data swamps. For the data scientist community that is charged with extracting more and more insights out of Datalakes, dredging that “swamp” over and over again with the freedom to use all or intelligently selected parts of the raw data is key. This in turn helps the data scientists find out which new and more powerful questions should be asked of the data and find better answers. Let’s embrace Datalakes to put an end to the data torture!