The world and its mother has gone cock-a-hoop for Hadoop and although interest in waning for the original Map/Reduce compute model, the ecosystem is nicely morphing and evolving from batch into a streaming and real-time architecture.
As for the HDFS – the storage side of Hadoop – most recently we’ve seen EMC announce that they’re planning to layer it over their ATMOS product line, which is certainly, in my opinion, the right thing to do. And, I mentioned that we were doing the same with HDFS on WOS a while ago (initial NGOSS meeting in November 2012).
HDFS is first and foremost an API and then an implementation. An object store is an ideal implementation model for HDFS and is certainly the most optimal if it is on top of a real object store and not on top of a pseudo one.
Let me elaborate. There are numerous products that have an ‘object’ semantic but are implemented using a traditional file system, like EXT4 or the likes. This is nothing more than lipstick (the API) on a pig (traditional file system) as none of the major benefits of object storage, such as storage efficiency and namespace scalability, will be achieved.
Traditional file systems such as EXT4 have very poor efficiency and this means that for every unit of storage you buy, up to 40% of that unit is used for the internal mechanics itself. It basically means that we are paying a “file system metadata tax” of 400GB for every TB we buy; and, in a world of big data dominated by lots of small files such as sensor or social data, this is an expensive tax.
A true object store, in my opinion, is characterized by efficiency, scalability, performance and resiliency. It’s a bit like four strings on a violin. If your strings are not correctly tuned, you won’t get the full benefit.
‘Pretend’ object store on a traditional file system might give you high availability, you will still have low performance and low efficiency. Trying to do object store on a traditional file system gives you an unbalanced system. HDFS semantic of the future, implemented in a true object storage way, such as our WOS product, is the best of both worlds.
So, while I commend EMC and HDFS across ATMOS, I do have concerns about recent statements from the company in regards to the future of Hadoop. A couple of weeks ago at the Strata Conference, EMC Greenplum made a whole hoopla about how they are now “all in” Hadoop; how they have 300 engineers all “committed” to Hadoop. I pray that the community will benefit from this R&D surge and that, as some have strongly stated, there will be no code “forking”.