Apache HBase

Wednesday May 01, 2013

Migration to the New Metrics Hotness – Metrics2

by Elliott Clarke

HBase Committer and Cloudera Engineer 

NOTE: This blog post describes the server code of HBase. It assumes a general knowledge of the system. You can read more about HBase in the blog posts here.

Introduction

HBase is a distributed big data store modeled after Google’s Bigtable paper. As with all distributed systems, knowing what’s happening at a given time can help  spot problems before they arise, debug on-going issues, evaluate new usage patterns, and provide insight into capacity planning.

Since October 2008, version 0.19.0 , (HBASE-625) HBase has been using Hadoop’s metrics system to export metrics to JMX, Ganglia, and other metrics sinks. As the code base grew, more and more metrics were added by different developers. New features got metrics. When users needed more data on issues, they added more metrics. These new metrics were not always consistently named, and some were not well documented.

As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started.  I also wanted to clean up the implementation cruft that had accumulated.

Definitions

The implementation details are pretty dense on terminology so lets make sure everything is defined:

  • Metric: A measurement of a property in the system.

  • Snapshot: A set of metrics at a given point in time.

  • Metrics1: The old Apache Hadoop metrics system.

  • Metrics2: The new overhauled Apache Hadoop Metrics system.

  • Source: A class that exposes metrics to the Hadoop metrics system.

  • Sink: A class that receives metrics snapshots from the Hadoop metrics system.

  • JMX: Java Management Extension. A system built into java that facilitates the management of java processes over a network; it includes the ability to expose metrics.

  • Dynamic Metrics: Metrics that come and go. These metrics are not all known at compile time; instead they are discovered at runtime.

Implementation

The Hadoop Metrics2 system implementations in branch-1 and branch-2 have diverged pretty drastically. This means that a single implementation of the code to move metrics from HBase to metrics2 sinks would not be performant or easy. As a result I created different hadoop compatibility shims and a system to load a version at runtime. This led to using ServiceLoader to create an instance of any class that touched parts of Hadoop that had changed between branch-1 and branch-2.

Here is an example of how a region server could request a Hadoop 2 version of the shim for exposing metrics about the HRegionServer. (Hadoop 1’s compatibility jar is shown in dotted lines to indicate that it could be swapped in if Hadoop 1 was being used)



This system allows HBase to support both Hadoop 1.x and Hadoop 2.x implementations without using reflection or other tricks to get around differences in API, usage, and naming.

Now that HBase can use either the Hadoop 1 or Hadoop 2 versions of the metrics 2 systems, I set about cleaning up what metrics HBase exposes, how those metrics are exposed, naming, and performance of gathering the data.

Metrics2 uses either annotations or sources to expose metrics. Since HBase can’t require any part of the metrics2 system in the core classes I exposed all metrics from HBase by creating sources. For metrics that are known ahead of time I created wrappers around classes in the core of HBase that the metrics2 shims could interrogate for values. Here is an example on how HRegionServer’s metrics(the non-dynamic metrics) are exposed :



The above pattern can be repeated to expose a great deal of the metrics that HBase has. However metrics about specific regions are still very interesting but can’t be exposed following the above pattern. So a new solution that would allow metrics about regions to be exposed by whichever HRegionServer is hosting that region was needed. To complicate things further Hadoop’s metrics2 system needs one MetricsSource to be responsible for all metrics that are going to be exposed through a JMX mbean. In order for metrics about regions to be well laid out, HBase needs a way to aggregate metrics from multiple regions into one source. This source will then be responsible for knowing what regions are assigned to the regionserver. These requirements led me to have one aggregation source that contains sudo-sources for each region. These sudo-sources each contain a wrapper around the region. This leads to something that looks like this:

Benefits

That’s a lot of work to re-do a previously working metrics system, so what was gained by all this work? The entire system is much easier to test in unit and systems tests. The whole system has been made more regular; that is everything follows the same patterns and naming conventions. Finally everything has been rewritten to be faster.

Since the previous metrics have all been added on as needed they were not all named well. Some metrics were named following the pattern: “metricNameCount” others were named following “numMetricName” while still others were named like “metricName_Count”. This made parsing hard and gave a generally chaotic feel. After the overhaul metrics that are a counter start with the camel cased metric name followed by the suffix “Count.” The mbeans were poorly laid out. Some metrics we spread out between two mbeans. Metrics about a region were under an mbean named Dynamic, not the most descriptive name. Now mbeans are much better organized and have better descriptions.

Tests have found that single threaded scans run as much as 9% faster after HBase’s old metrics system has been replaced. The previous system used lots of ConcurrentHashMap’s to store dynamic metrics. All accesses to mutate these metrics required a lookup into these large hash maps. The new system minimizes the use of maps. Instead every region or server exports metrics to one pseudo source. The only changes to hashmaps in the metrics system occurs on region close or open.

Conclusions

Overall the whole system is just better. The process was long and laborious, but worth it to make sure that HBase’s metrics system is in a good state. HBase 0.95, and later 0.96, will have the new metrics system.  There’s still more work to be completed but great strides have been made.

Thursday Apr 25, 2013

HBase - Who needs a Master?

By Matteo Bertozzi (mbertozzi at apache dot org), HBase Committer and Engineer on the Cloudera HBase Team.

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.


HBase provides low-latency random reads and writes on top of HDFS and it’s able to handle petabytes of data. One of the interesting capabilities in HBase is Auto-Sharding, which simply means that tables are dynamically distributed by the system when they become too large.

Regions and Region Servers

The basic unit of scalability, that provides the horizontal scalability, in HBase is called Region. Regions are a subset of the table’s data, and they are essentially a contiguous, sorted range of rows that are stored together.

Initially, there is only one region for a table.  When regions become too large after adding more rows, the region is split into two at the middle key, creating two roughly equal halves.

Looking back at the HBase architecture the slaves are called Region Servers. Each Region Server is responsible to serve a set of regions, and one Region (i.e. range of rows) can be served only by one Region Server.


The HBase Architecture has two main services: HMaster that is responsible for coordinating Regions in the cluster and execute administrative operations; HRegionServer responsible to handle a subset of the table’s data.

HMaster, Region Assignment and Balancing

The HBase Master coordinates the HBase Cluster and is responsible for administrative operations.


A Region Server can serve one or more Regions. Each Region is assigned to a Region Server on startup and the master can decide to move a Region from one Region Server to another as the result of a load balance operation. The Master also handles Region Server failures by assigning the region to another Region Server.


The mapping of Regions to Region Servers is kept in a system table called META. By reading META, you can identify which region is responsible for your key. This means that for read and write operations, the master is not involved at all, and clients can go directly to the Region Server responsible to serve the requested data.

Locating a Row-Key: Which Region Server is responsible?

To put or get a row clients don’t have to contact the master, clients can contact directly  the Region Server that handles the specified row. In the case of a scan clients can contact directly the set of Region Servers responsible for handling the set of keys.

To identify the Region Server, the client does a query on the META table.META is a system table, used to keep track of regions. It contains the server name and a region identifier comprised of a table name and the start row-key. By looking at the start-key and the next region start-key clients are able to identify the range of rows contained in a a particular region.

The client keeps a cache for the region locations. This avoids the need for clients to hit the META table every time an operation on the same region is issued. In case of a region split or move to another Region Server (due to balancing, or assignment policies) the client will receive an exception as response and the cache will be refreshed by fetching the updated information from the META table.


Since META is a table like the others, the client has to identify on which server META is located. The META locations are stored in a ZooKeeper node on assignment by the Master, and the client reads directly the node to get the address of the Region Server that contains META.


The original design was based on BigTable with another table called -ROOT- containing the META locations and ZooKeeper pointing to it. HBase 0.96 removed that in favor of ZooKeeper only since META cannot be split and therefore consists of a single region.

Client API - Master and Regions responsibilities

The HBase java client API is composed of two main interfaces.

  • HBaseAdmin: allows interaction with the “table schema" by creating/deleting/modifying tables, and it allows interaction with the cluster by assigning/unassigning regions, merging regions together, calling for a flush, and so on.  This interface communicates with the Master.

  • HTable: allows the client to manipulate the data of a specified table, by using get, put, delete and all the other data operations.  This interface communicates directly with the Region Servers responsible for handling the requested set of keys.

Those two interfaces have separate responsibilities: HBaseAdmin is only used to execute admin operations and communicate with the Master while the HTable is used to manipulate data and communicate with the Regions.

Conclusion

As we’ve seen in this article, having a Master/Slave architecture does not mean that each operation goes through the master. To read and write data the HBase client, in fact, goes directly to the specific Region Server responsible to handle the row keys for all the data operations (HTable). The Master is used by the client only for table creation, modification and deletion operations (HBaseAdmin).

Although there exists a concept of a Master, the HBase client does not depend on it for data operations and the cluster can keep serving data even if the master goes down. 

Friday Jan 11, 2013

Apache HBase Internals: Locking and Multiversion Concurrency Control

by Gregory Chanan

HBase Committer

Cloudera, Inc. 

NOTE: This blog post describes how Apache HBase does concurrency control.  This assumes knowledge of the HBase write path, which you can read more about in this other blog post.

Introduction

Apache HBase provides a consistent and understandable data model to the user while still offering high performance.  In this blog, we’ll first discuss the guarantees of the HBase data model and how they differ from those of a traditional database.  Next, we’ll motivate the need for concurrency control by studying concurrent writes and then introduce a simple concurrency control solution.  Finally, we’ll study read/write concurrency control and discuss an efficient solution called Multiversion Concurrency Control.

Why Concurrency Control?

In order to understand HBase concurrency control, we first need to understand why HBase needs concurrency control at all; in other words, what properties does HBase guarantee about the data that requires concurrency control?

The answer is that HBase guarantees ACID semantics per-row.  ACID is an acronym for:
• Atomicity: All parts of transaction complete or none complete
• Consistency: Only valid data written to database
• Isolation: Parallel transactions do not impact each other’s execution
• Durability: Once transaction committed, it remains

If you have experience with traditional relational databases, these terms may be familiar to you.  Traditional relational databases typically provide ACID semantics across all the data in the database; for performance reasons, HBase only provides ACID semantics on a per-row basis.  If you are not familiar with these terms, don’t worry.  Instead of dwelling on the precise definitions, let’s look at a couple of examples.

Writes and Write-Write Synchronization

Consider two concurrent writes to HBase that represent {company, role} combinations I’ve held:



Image 1.  Two writes to the same row

From the previously cited Write Path Blog Post, we know that HBase will perform the following steps for each write:

(1) Write to Write-Ahead-Log (WAL)
(2) Update MemStore: write each data cell [the (row, column) pair] to the memstore
List 1. Simple list of write steps

That is, we write to the WAL for disaster recovery purposes and then update an in-memory copy (MemStore) of the data.

Now, assume we have no concurrency control over the writes and consider the following order of events:



Image 2.  One possible order of events for two writes

At the end, we are left with the following state:


Image 3.  Inconsistent result in absence of write-write synchronization

which is a role I’ve never held.  In ACID terms, we have not provided Isolation for the writes, as the two writes became intermixed.

We clearly need some concurrency control.  The simplest solution is to provide exclusive locks per row in order to provide isolation for writes that update the same row.  So, our new list of steps for writes is as follows (new steps are in blue).

(0) Obtain Row Lock
(1) Write to Write-Ahead-Log (WAL)
(2) Update MemStore: write each cell to the memstore
(3) Release Row Lock
List 2: List of write-steps with write-write synchronization

Read-Write Synchronization

So far, we’ve added row locks to writes in order to guarantee ACID semantics.  Do we need to add any concurrency control for reads?  Let’s consider another order of events for our example above (Note that this order follows the rules in List 2):
Image 4.  One possible order of operations for two writes and a read

Assume no concurrency control for reads and that we request a read concurrently with the two writes.  Assume the read is executed directly before “Waiter” is written to the MemStore; this read action is represented by a red line above.  In that case, we will again read the inconsistent row:


Image 5.  Inconsistent result in absence of read-write synchronization

Therefore, we need some concurrency control to deal with read-write synchronization.  The simplest solution would be to have the reads obtain and release the row locks in the same manner as the writes.  This would resolve the ACID violation, but the downside is that our reads and writes would both contend for the row locks, slowing each other down.

Instead, HBase uses a form of Multiversion Concurrency Control (MVCC) to avoid requiring the reads to obtain row locks.  Multiversion Concurrency Control works in HBase as follows:

For writes:
(w1) After acquiring the RowLock, each write operation is immediately assigned a write number
(w2) Each data cell in the write stores its write number.
(w3) A write operation completes by declaring it is finished with the write number.

For reads:
(r1)  Each read operation is first assigned a read timestamp, called a read point.
(r2)  The read point is assigned to be the highest integer such that all writes with write number <= x have been completed.
(r3)  A read r for a certain (row, column) combination returns the data cell with the matching (row, column) whose write number is the largest value that is less than or equal to the read point of r.
List 3. Multiversion Concurrency Control steps

Let’s look at the operations in Image 4 again, this time using MultiVersion Concurrency Control:
Image 6.  Write steps with Multiversion Concurrency Control

Notice the new steps introduced for Multiversion Concurrency Control.  Each write is assigned a write number (step w1), each data cell is written to the memstore with its write number (step w2, e.g. “Cloudera [wn=1]”) and each write completes by finishing its write number (step w3).

Now, let’s consider the read in Image 4, i.e. a read that begins after step “Restaurant [wn=2]” but before the step “Waiter [wn=2]”.  From rule r1 and r2, its read point will be assigned to 1.  From r3, it will read the values with write number of 1, leaving us with:

Image 7.  Consistent answer with Multiversion Concurrency Control

A consistent response without requiring locking the row for the reads!

Let’s put this all together by listing the steps for a write with Multiversion Concurrency Control: (new steps required for read-write synchronization are in red):
(0) Obtain Row Lock
(0a) Acquire New Write Number
(1) Write to Write-Ahead-Log (WAL)
(2) Update MemStore: write each cell to the memstore
(2a) Finish Write Number
(3) Release Row Lock

Conclusion

In this blog we first defined HBase’s row-level ACID guarantees.  We then demonstrated the need for concurrency control by studying concurrent writes and introduced a row-level locking solution.  Finally, we investigated read-write concurrency control and presented an efficient mechanism called Multiversion Concurrency Control (MVCC).

This blog post is accurate as of HBase 0.92.  HBase 0.94 has various optimizations, e.g. HBASE-5541 that will be described in a future blog post.

 

Thursday Mar 29, 2012

HBase Project Management Committee Meeting Minutes, March 27th, 2012

The bulk of the HBase Project Management Committed (PMC) met at the StumbleUpon offices in San Francisco, March 27th, 2012, ahead of the HBase Meetup that happened later in the evening. Below we post the agenda and minutes to let you all know what was discussed but also to solicit input and comment.

Agenda

A suggested agenda had been put on the PMC mailing list the day before the meeting soliciting that folks add their own items ahead of the meeting. The hope was that we'd put the agenda out on the dev mailing list before the meeting started to garner more suggestions but that didn’t happen. The agenda items below sort of followed on from the contributor pow-wow we had at salesforce a while back -- pow-wow agenda, pow-wow minutes (or see summarized version) -- but it was thought that we could go into more depth on a few of the topics raised then, in particular, project existential matters.

Here were the items put up for discussion:

  • Where do we want HBase to be in two years? What will success look like? Discuss. Make a short list.
  • How do we achieve the just made short list? What resources do we have to hand? What can we deploy in the short term to help achieve our just stated objectives? What do we need to prioritize in near future? Make a short list.
  • How do we exchange best practices/design decisions when developing new features? Sometimes there may be more things that can be shared if everyone follows the same best practices, and less features need to be implemented.

Minutes 

Attendees

  • Todd Lipcon
  • Ted Yu
  • Jean-Daniel Cryans
  • Jon Hsieh
  • Karthik Ranganathan
  • Kannan Muthukkaruppan
  • Andrew Purtell
  • Jon Gray
  • Nicolas Spiegelberg
  • Gary Helmling
  • Mikhail Bautin
  • Lars Hofhansl
  • Michael Stack


Todd, the secretary, took notes. St.Ack summarized his notes into the below minutes. The meeting started at about 4:20 (after the requisite 20 minutes dicking w/ A/V/ and dial-in setup).

“Where do we want HBase to be in two years? What will success look like? Discuss. Make a short (actionable) list.”

Secondary indexes and transactions were suggested as was operations on a parity w/ MySQL and rock solid stable so it could be used as primary copy of data. It was also suggested that we expend effort making HBase more usable out of the box (auto-tuning/defaults). Then followed discussion of who is HBase for? Big companies? Or new users, or startups? Is our goal stimulating demand and creating demand? Or is it to be reactive to what problems people are actually hitting? A dose of reality had it that while it would be nice to make all possible users happy, and even to talk about doing this, in actuality, we are mostly going to focus on what our employers need rather than prospective ‘customers’.

After this detour, the list making became more blunt and basic. It was suggested that we build a rock solid db which other people might be able to build on top of for higher-level stuff. The core engine needs to work reliably -- lets do this first -- and then talk of new features and add-ons. Meantime, we can point users to coprocessers for building extensions and features w/o their needing to touch core hbase (It was noted that we are open to extending CPs if we have to to extend the ‘control surface’ exposed but that coverage has been pretty much sufficient up to this). Usability is important but operability is more important. Don’t need to target n00bs. First nail ops being able to understand whats going on.

After further banter, we arrived at list: reliability, operability (insight into the running application, dynamic config. changes, usability improvements that make it easier on a clueful ops), and performance (in this order). It was offered that we are not too bad on performance -- especially in 0.94 -- and that use cases will drive the performance improvements so focus should be on the first two items in the list.

“How do we achieve the just made short list?”

To improve reliability, testing has to be better. This has been said repeatedly in the past. It was noted that progress has been made at least on our unit test story (Tests have been //ized, more of hbase is now testable because of refactorings). Progress on integration tests and or contributions to Apache Bigtop have not progressed. As is, BigTop is a little "cryptic"; its a different harness with shim layers to insulate against version changes. We should help make it easier. We should add being able to do fault injection. Should hbase integration tests be done out in the open continuously running on something like an EC2 cluster that all can observe? This is the BigTop goal but its not yet there. Of note, EC2 can’t be used validating performance. Too erratic. Bottom line, improving testing requires bodies. Resources such as hardware, etc., are relatively easy to come by. Hard is getting an owner and bodies to write the tests and test infrastructure. Each of us has our own hack testing setup. Doing a general test infrastructure whether on BigTop or not is a bit of chicken and egg problem. Lets just make sure that who ever hits the need to do this general test infrastructure tooling first, that they do it out in the open and that we all pile on and help. Meantime we'll keep up w/ our custom test tools.

Regards the current state of reliability, its thought that as is, we can’t run a cluster continuously w/ a chaos monkey operating. There are probably still “hundreds of issues” to be fixed before we’re at a state where we can run for days under “chaos monkey” conditions. For example, a recent experiment killing a rack in a large cluster and left it down for an hour. This turned up lots of bugs (On the other hand, we learned that an experiment done by another company recently where the downtime was less ‘recovered’ without problems). Areas to explore improving reliability include testing network failure scenarios, better MTTR, and better toleration of network partitions.

Also on reliability, what core fixes are outstanding? There are still races in the master and issues where bulk cluster operations -- enable/disable -- fail in the middle. We need zookeeper-intent log (or something like Accumulo FATE) and table read/write locks. Meantime, kudos to the recently checked in hbck work because this can do fixup until missing fixes are put in place.

Regards operability, we need more metrics -- e.g. 95th/99th percentile metrics (some of these just got checked in) -- and better stories around backups, point in time recovery, and cross-column family snapshots. We need to encourage/facilitate/deploy move to HA NN. Talk was about more metrics client-side rather than on the server. On metrics, what else do we need to expose? Log messages should be ‘actionable’; include help on what you need to do to solve an issue. Dynamic config. is going to help operability (here soon); we need to make sure that the important configs are indeed changeable on the fly.

 Its thought that performance is taking care of itself (witness the suite of changes in 0.94.0).

“How do we exchange best practices, etc, when developing new features?”

Many are solving similar problems. We don’t always need to build new features. Maybe ops tricks are enough in many cases (two clusters if need two applications isolated rather than build a bunch of multi-tenacy code). People often go deep into design/code, then post the design + code only after they already spent a long time. Suggested that first should be discussion with the community early before writing code and designs. Regards any features proposal, its thought that the ‘why’ is the most important thing that needs justifying, not necessarily the technical design. Also, testability and disruption of core needs to be considered proposing new feature. Designs or any document needs to be short. Two pages at max. Hard for folks to spend the time if it goes on longer (Hadoop’ HEP process was mentioned at a proposal that failed).

General Discussion 

A general discussion went around on what to do about features not wanted whether because of the justification, the design, or the code and of how the latter is hardest to deal with especially if a feature large (Code review takes time). Was noted that we don’t have to commit everything, that we can revert stuff, and that its ok to throw something away even if a bunch of work has been done (A recent fb example around reuse of block cache blocks was cited where a bunch of code and review resulted in conclusion that the benefits were inconclusive so the project was abandoned). It was noted that the onus is on us to help contributors better understand what would be good things to work on moving the project forward. Was suggested that rather than a ‘sea of jiras’, instead we’d surface a short-list of what we think an important list of things to work on. General assent that roadmaps don’t work but should be easy to put up list of whats important in near and long term future for prospective contributors to pick up on. Was noted though that we also need to remain open to new features, etc., that don’t fit near-or-far-term project themes. This is open source after all. Was mentioned that we should work on high-level guidelines for how best to contribute. These need to be made public. Make a start, any kinda of start, on defining the project focus / design rationale. It doesn’t have to be perfect - just put a few words on the page, maybe even up on the home page.

Meeting was adjourned around 5:45pm.

Wednesday Feb 01, 2012

Coprocessor Introduction

Coprocessors is a new feature added to HBase 0.92. It provides great potential to extend HBase by adding user defined functionality to HBase.[Read More]

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation