Apache HBase

Thursday August 08, 2013

Taking the Bait

Lars Hofhansl, Andrew Purtell, and Michael Stack

HBase Committers

Information Week recently published an article titled “Will HBase Dominate NoSQL?”. Michael Hausenblas of MapR argues the ‘For’ HBase case and Jonathan Ellis of Apache Cassandra and vendor DataStax argues ‘Against’.

It is easy to dismiss this 'debate' as vendor sales talk and just get back to using and improving Apache HBase, but this article is a particularly troubling example. Here in both the ‘For’ and ‘Against’ arguments slight is being cast on the work of the HBase community. Here are some notes by way of redress:

First, Michael argues Hadoop is growing fast and because HBase came out of Hadoop and is tightly associated, ergo, HBase is on the up and up.  It is easy to make this assumption if you are not an active participant in the HBase community (We have also come across the inverse where HBase is driving Hadoop adoption). Michael then switches to a tired HBase versus Cassandra bake off, rehashing the old consistency versus eventual-consistency wars, ultimately concluding that HBase is ‘better’ simply because Facebook, the Cassandra wellspring, dropped it to use HBase instead as the datastore for some of their large apps. We would not make that kind of argument. Whether or not one should utilize Apache HBase or Apache Cassandra depends on a number of factors and is, like with most matters of scale, too involved a discussion for sound bites.

Then Michael does a bait-and-switch where he says “..we’ve created a ‘next version’ of enterprise HBase.... We brought it into GA under the label M7 in May 2013”.  The ‘We’ in the quote does not refer to the Apache HBase community but to Michael’s employer, MapR Technologies and the ‘enterprise HBase’ he is talking of is not Apache HBase.  M7 is a proprietary product that, to the best of our knowledge, is fundamentally different architecturally from Apache HBase. We cannot say more because of the closed source nature of the product. This strikes us as an attempt to attach the credit and good will that the Apache HBase community have all built up over the years through hard work and contributions to a commercial closed source product that is NOT Apache HBase.

Now let us address Jonathan’s “Against” argument.  Some of Jonathan’s claims are no longer true: “RegionServer failover takes 10 to 15 minutes” (see HDFS-3703 and HDFS-3912), or highly subjective: “Developing against HBase is painful.”  (In our opinion, our client API is simpler and easier to use than the commonly used Cassandra client libraries.) Otherwise, we find nothing here that has not been hashed and rehashed out over the years in forums ranging from mailing lists to Quora. Jonathan is mostly listing out provence, what comes of our being tightly coupled to Apache Hadoop’s HDFS filesystem and our tracing the outline of the Google BigTable Architecture. HBase derives many benefits from HDFS and inspiration from BigTable. As a result, we excel at some use cases that would be problematic for Cassandra. The reverse is also true. Neither we nor HDFS are standing still as the hardware used by our users evolves, or as new users bring experience from new use cases.

He highlights a difference in approach.

We see our adoption of Apache HDFS, our integration with Apache Hadoop MapReduce, and our use of Apache ZooKeeper for coordination as a sound separation of concerns. HBase is built on battle tested components. This is a feature, not a bug.

Where Jonathan might see a built in query language and secondary indexing facility as necessary complications to core Cassandra, we encourage and support projects like Salesforce’s Phoenix as part of a larger ecosystem centered around Apache HBase. The Phoenix guys are able to bring domain expertise to that part of the problem while we (HBase) can focus on providing a stable and performant storage engine core. Part of what has made Apache Hadoop so successful is its ecosystem of supporting and enriching projects - an ecosystem that includes HBase. An ecosystem like that is developing around HBase.

When Jonathan veers off to talk of the HBase community being “fragmented” with divided “[l]eadership”, we think perhaps what is being referred to is the fact that the Apache HBase project is not an “owned” project, a project led by a single vendor.  Rather it is a project where multiple teams from many organizations - Cloudera, Hortonworks, Salesforce, Facebook, Yahoo, Intel, Twitter, and Taobao, to name a few - are all pushing the project forward in collaboration. Most of the Apache HBase community participates in shared effort on two branches - what is about to be our new 0.96 release, and on another branch for our current stable release, 0.94. Facebook also maintains their own branch of Apache HBase in our shared source control repository. This branch has been a source of inspiration and occasional direct contribution to other branches. We make no apologies for finding a convenient and effective collaborative model according to the needs and wishes of our community. To other projects driven by a single vendor this may seem suboptimal or even chaotic (“fragmented”).

We’ll leave it at that.

As always we welcome your participation in and contributions to our community and the Apache HBase project. We have a great product and a no-ego low-friction decision making process not beholden to any particular commercial concern. If you have an itch that needs scratching and wonder if Apache HBase is a solution, or, if you are using Apache HBase but feel it could work better for you, come talk to us.



Hot Blogs (today's hits)

Tag Cloud