Apache HBase

Friday July 14, 2017

HBASE APPLICATION ARCHETYPES REDUX (Part 1 of 2)

by Robert Yokota, HBase Contributor

(This post originally appeared on Robert's personal blog. It is reposted here as a two-parter. The second-part can be found here.)

At Yammer, we’ve transitioned away from polyglot persistence to persistence consolidation. In a microservice architecture, the principle that each microservice should be responsible for its own data had led to a proliferation of different types of data stores at Yammer. This in turn led to multiple efforts to make sure that each data store could be easily used, monitored, operationalized, and maintained. In the end, we decided it would be more efficient, both architecturally and organizationally, to reduce the number of data store types in use at Yammer to as few as possible.

Today HBase is the primary data store for non-relational data at Yammer (we use PostgreSQL for relational data).  Microservices are still responsible for their own data, but the data is segregated by cluster boundaries or mechanisms within the data store itself (such as HBase namespaces or PostgreSQL schemas).

HBase was chosen for a number of reasons, including its performance, scalability, reliability, its support for strong consistency, and its ability to support a wide variety of data models.  At Yammer we have a number of services that rely on HBase for persistence in production:

  • Feedie, a feeds service
  • RoyalMail, an inbox service
  • Ocular, for tracking messages that a user has viewed
  • Streamie, for storing activity streams
  • Prankie, a ranking service with time-based decay
  • Authlog, for authorization audit trails
  • Spammie, for spam monitoring and blocking
  • Graphene, a generic graph modeling service

HBase is able to satisfy the persistence needs of several very different domains. Of course, there are some use cases for which HBase is not recommended, for example, when using raw HDFS would be more efficient, or when ad-hoc querying via SQL is preferred (although projects like Apache Phoenix can provide SQL on top of HBase).

Previously, Lars George and Jonathan Hsieh from Cloudera attempted to survey the most commonly occurring use cases for HBase, which they referred to as application archetypes.  In their presentation, they categorized archetypes as either “good”, “bad”, or “maybe” when used with HBase. Below I present an augmented listing of their “good” archetypes, along with pointers to projects that implement them.

ENTITY

The Entity archetype is the most natural of the archetypes.  HBase, being a wide column store, can represent the entity properties with individual columns.  Projects like Apache Gora and HEntityDB support this archetype.

Column Family: default
Row Key Column: <property 1 name> Column: <property 2 name>
<entity ID>  <property 1 value> <property 2 value>

Entities can be also stored in the same manner as with a key-value store.  In this case the entity would be serialized as a binary or JSON value in a single column.

Column Family: default
Row Key Column: body
<entity ID>  <entity blob>

SORTED COLLECTION

The Sorted Collection archetype is a generalization of the original Messaging archetype that was presented.  In this archetype the entities are stored as binary or JSON values, with the column qualifier being the value of the sort key to use.  For example, in a messaging feed, the column qualifier would be a timestamp or a monotonically increasing counter of some sort.  The column qualifier can also be “inverted” (such as by subtracting a numeric ID from the maximum possible value) so that entities are stored in descending order.

Column Family: default
Row Key Column: <sort key 1 value> Column: <sort key 2 value>
<collection ID>  <entity 1 blob> <entity 2 blob>

Alternatively, each entity can be stored as a set of properties.  This is similar to how Cassandra implements CQL.  HEntityDB supports storing entity collections in this manner.

Column Family: default
Row Key Column: <sort key 1 value + property 1 name> Column: <sort key 1 value + property 2 name> Column: <sort key 2 value + property 1 name> Column: <sort key 2 value + property 2 name>
<collection ID> <property 1 of entity 1> <property 2 of entity 1> <property 1 of entity 2> <property 2 of entity 2>

In order to access entities by some other value than the sort key, additional column families representing indices can be used.

Column Family: sorted Column Family: index
Row Key Column: <sort key 1 value> Column: <sort key 2 value> Column: <index 1 value> Column: <index 2 value>
<collection ID>  <entity 1 blob> <entity 2 blob> <entity 1 blob> <entity 2 blob>

To prevent the collection from growing unbounded, a coprocessor can be used to trim the sorted collection during compactions.  If index column families are used, the coprocessor would also remove corresponding entries from the index column families when trimming the sorted collection.  At Yammer, both the Feedie and RoyalMail services use this technique.  Both services also use server-side filters for efficient pagination of the sorted collection during queries.

Continued here...


Comments:

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation