Entries tagged [hbase]

Thursday Dec 17, 2015

Offheaping the Read Path in Apache HBase: Part 1 of 2

Detail on the work involved making it so the Apache HBase read path could work against off heap memory (without copy).[Read More]

Saturday Jun 06, 2015

Saving CPU! Using Native Hadoop Libraries for CRC computation in HBase

by Apekshit Sharma, HBase contributor and Cloudera Engineer

TL;DR Use Hadoop Native Library calculating CRC and save CPU!

Checksums in HBase

Checksum are used to check data integrity. HDFS computes and stores checksums for all files on write. One checksum is written per chunk of data (size can be configured using bytes.per.checksum) in a separate, companion checksum file. When data is read back, the file with corresponding checksums is read back as well and is used to ensure data integrity. However, having two files results in two disk seeks reading any chunk of data. For HBase, the extra seek while reading HFileBlock results in extra latency. To work around the extra seek, HBase inlines checksums. HBase calculates checksums for the data in a HFileBlock and appends them to the end of the block itself on write to HDFS (HDFS then checksums the HBase data+inline checksums). On read, by default HDFS checksum verification is turned off, and HBase itself verifies data integrity.

Can we then get rid of HDFS checksum altogether? Unfortunately no. While HBase can detect corruptions, it can’t fix them, whereas HDFS uses replication and a background process to detect and *fix* data corruptions if and when they happen. Since HDFS checksums generated at write-time are also available, we fall back to them when HBase verification fails for any reason. If the HDFS check fails too, the data is reported as corrupt.

The related hbase configurations are hbase.hstore.checksum.algorithm, hbase.hstore.bytes.per.checksum and hbase.regionserver.checksum.verify. HBase inline checksums are enabled by default.

Calculating checksums is computationally expensive and requires lots of CPU. When HDFS switched over to JNI + C for computing checksums, they witnessed big gains in CPU usage.

This post is about replicating those gains in HBase by using Native Hadoop Libraries (NHL). See HBASE-11927


We switched to use the Hadoop DataChecksum library which under-the-hood uses NHL if available, else we fall back to use the Java CRC implementation. Another alternative considered was the ‘Circe’ library. The following table highlights the differences with NHL and makes the reasoning for our choice clear.

Hadoop Native Library


Native code supports both crc32 and crc32c

Native code supports only crc32c

Adds dependency on hadoop-common which is reliable and actively developed

Adds dependency on external project

Interface supports taking in stream of data, stream of checksums, chunk size as parameters and compute/verify checksums  considering data in chunks.

Only supports calculation of single checksum for all input data.

Both libraries supported use of the special x86 instruction for hardware calculation of CRC32C if available (defined in SSE4.2 instruction set). In the case of NHL, hadoop-2.6.0 or newer version is required for HBase to get the native checksum benefit.

However, based on the data layout of HFileBlock, which has ‘real data’ followed by checksums on the end, only NHL supported the interface we wanted. Implementing the same in Circe would have been significant effort. So we chose to go with NHL.


Since the metric to be evaluated was CPU usage, a simple configuration of two nodes was used. Node1 was configured to be the NameNode, Zookeeper and HBase master. Node2 was configured to be DataNode and RegionServer. All real computational work was done on Node2 while Node1 remained idle most of the time. This isolation of work on a single node made it easier to measure impact on CPU usage.


Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-24-generic x86_64)

CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz                                          

Socket(s) : 1

Core(s) per socket : 1

Thread(s) per core : 4

Logical CPU(s) : 4

Number of disks : 1

Memory : 8 GB

HBase Version/Distro *: 1.0.0 / CDH 5.4.0

*Since trunk resolves to hadoop-2.5.1 which does not have HDFS-6865, it was easier to use a CDH distro which already has HDFS-6865 backported.


We chose to study the impact on major compactions, mainly because of the presence of CompactionTool which a) can be used offline, b) allowed us to profile only the relevant workload. PerformanceEvaluation (bin/hbase pe) was used to build a test table which was then copied to local disk for reuse.

% ./bin/hbase pe --nomapred --rows=150000 --table="t1" --valueSize=10240 --presplit=10 sequentialWrite 10

Table size: 14.4G

Number of rows: 1.5M

Number of regions: 10

Row size: 10K

Total store files across regions: 67

For profiling, Lightweight-java-profiler was used and FlameGraph was used to generate graphs.

For benchmarking, the linux ‘time’ command was used. Profiling was disabled during these runs. A script repeatedly executed following in order:

  1. delete hdfs:///hbase

  2. copy t1 from local disk hdfs:///hbase/data/default

  3. run compaction tool on t1 and time it



CPU profiling of HBase not using NHL (figure 1) shows that about 22% cpu is used for generating and validating checksums, whereas, while using NHL (figure 2) it takes only about 3%.

Screen Shot 2015-06-02 at 8.14.07 PM.png

Figure 1: CPU profile - HBase not using NHL (svg)

Screen Shot 2015-06-02 at 8.03.39 PM.png

Figure 2: CPU profile - HBase using NHL (svg)


Benchmarking was done for three different cases: (a) neither HBase nor HDFS use NHL, (b) HDFS uses NHL but not HBase, and (c) both HDFS and HBase use NHL. For each case, we did 5 runs. Observations from table 1:

  1. Within a case, while real time fluctuates across runs, user and sys times remain same. This is expected as compactions are IO bound.

  2. Using NHL only for HDFS reduces CPU usage by about 10% (A vs B)

  3. Further, using NHL for HBase checksums reduces CPU usage by about 23% (B vs C).

All times are in seconds. This stackoverflow answer provides a good explaination of real, user and sys times.

run #

no native for HDFS and HBase (A)

no native for HBase (B)

native (C)


real      469.4

user 110.8

sys 30.5

real    422.9

user    95.4

sys     30.5

real 414.6

user 67.5

sys 30.6


real 384.3

user 111.4

sys 30.4

real 400.5

user 96.7

sys 30.5

real 393.8

user     67.6

sys 30.6


real 400.7

user 111.5

sys 30.6

real 398.6

user 95.8

sys 30.6

real    392.0

user    66.9

sys     30.5


real 396.8

user 111.1

sys 30.3

real 379.5

user 96.0

sys 30.4

real    390.8

user    67.2

sys     30.5


real 389.1

user 111.6

sys 30.3

real 377.4

user 96.5

sys 30.4

real    381.3

user    67.6

sys     30.5

Table 1



Native Hadoop Library leverages the special processor instruction (if available) that does pipelining and other low level optimizations when performing CRC calculations. Using NHL in HBase for heavy checksum computation, allows HBase make use of this facility, saving significant amounts of CPU time checksumming.

Tuesday May 12, 2015

The HBase Request Throttling Feature

Govind Kamat, HBase contributor and Cloudera Performance Engineer

(Edited on 5/14/2015 -- changed ops/sec/RS to ops/sec.)

Running multiple workloads on HBase has always been challenging, especially  when trying to execute real-time workloads while concurrently running analytical jobs. One possible way to address this issue is to throttle analytical MR jobs so that real-time workloads are less affected

A new QoS (quality of service) feature that Apache HBase 1.1 introduces is request-throttling, which controls the rate at which requests get handled by a HBase cluster.   HBase typically treats all requests identically; however, the new throttling feature can be used to specify a maximum rate or bandwidth to override this behavior.  The limit may be applied to a requests originating from a particular user, or alternatively, to requests directed to a given table or a specified namespace.

The objective of this post is to evaluate the effectiveness of this feature and the overhead it might impose on a running HBase workload.  The performance runs carried out showed that throttling works very well, by redirecting resources from a user whose workload is throttled to the workloads of other users, without incurring a significant overhead in the process.

Enabling Request Throttling

It is straightforward to enable the request-throttling feature -- all that is necessary is to set the HBase configuration parameter hbase.quota.enabled to true.  The related parameter hbase.quota.refresh.period  specifies the time interval in milliseconds that that regionserver should re-check for any new restrictions that have been added.

The throttle can then be set from the HBase shell, like so:

hbase> set_quota TYPE => THROTTLE, USER => 'uname', LIMIT => '100req/sec'

hbase> set_quota TYPE => THROTTLE, TABLE => 'tbl', LIMIT => '10M/sec'

hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns', LIMIT => 'NONE'

Test Setup

To evaluate how effectively HBase throttling worked, a YCSB workload was imposed on a 10 node cluster.  There were 6 regionservers and 2 master nodes.  YCSB clients were run on the 4 nodes that were not running regionserver processes.  The client processes were initiated by two separate users and the workload issued by one of them was throttled.

More details on the test setup follow.

HBase version: HBase 0.98.6-cdh5.3.0-SNAPSHOT (HBASE-11598 was backported to this version)


CentOS release 6.4 (Final)                                                

CPU sockets: 2                                                            

Physical cores per socket: 6

Total number of logical cores: 24

Number of disks: 12

Memory: 64 GB

Number of RS: 6

Master nodes: 2  (for the Namenode, Zookeeper and HBase master)

Number of client nodes: 4

Number of rows: 1080M

Number of regions: 180

Row size: 1K

Threads per client: 40

Workload: read-only and scan

Key distribution: Zipfian

Run duration: 1 hour


An initial data set was first generated by running YCSB in its data generation mode.  A HBase table was created with the table specifications above and pre-split.  After all the data was inserted, the table was flushed, compacted and saved as a snapshot.  This data set was used to prime the table for each run.  Read-only and scan workloads were used to evaluate performance; this eliminates effects such as memstore flushes and compactions.  One run with a long duration was carried out first to ensure the caches were warmed and that the runs yielded repeatable results.

For the purpose of these tests, the throttle was applied to the workload emanating from one user in a two-user scenario. There were four client machines used to impose identical read-only workloads.  The client processes on two machines were run by the user “jenkins”, while those on the other two were run as a different user.   The throttle was applied to the workload issued by this second user.  There were two sets of runs, one with both users running read workloads and the second where the throttled user ran a scan workload.  Typically, scans are long running and it can be desirable on occasion to de-prioritize them in favor of more real-time read or update workloads.  In this case, the scan was for sets of 100 rows per YCSB operation.

For each run, the following steps were carried out:

  • Any existing YCSB-related table was dropped.

  • The initial data set was cloned from the snapshot.

  • The desired throttle setting was applied.

  • The desired workloads were imposed from the client machines.

  • Throughput and latency data was collected and is presented in the table below.

The throttle was applied at the start of the job (the command used was the first in the list shown in the “Enabling Request Throttling” section above).  The hbase.quota.refresh.period property was set to under a minute so that the throttle took effect by the time test setup was finished.

The throttle option specifically tested here was the one to limit the number of requests (rather than the one to limit bandwidth).

Observations and Results

The throttling feature appears to work quite well.  When applied interactively in the middle of a running workload, it goes into effect immediately after the the quota refresh period and can be observed clearly in the throughput numbers put out by YCSB while the test is progressing.  The table below has performance data from test runs indicating the impact of the throttle.  For each row, the throughput and latency numbers are also shown in separate columns, one set for the “throttled” user (indicated by “T” for throttled) and the other for the “non-throttled” user (represented by “U” for un-throttled).

Read + Read Workload

Throttle (req/sec)

Avg Total Thruput (ops/sec)

Thruput_U (ops/sec)

Thruput_T (ops/sec)

Latency_U (ms)

Latency_T (ms)







2500 rps






2000 rps






1500 rps






1000 rps






500 rps







As can be seen, when the throttle pressure is increased (by reducing the permitted throughput for user “T” from 2500 req/sec to 500 req/sec, as shown in column 1), the total throughput (column 2) stays around the same.  In other words, the cluster resources get redirected to benefit the non-throttled user, with the feature consuming no significant overhead.  One possible outlier is the case where the throttle parameter is at its most restrictive (500 req/sec), where the total throughput is about 10% less than the maximum cluster throughput.

Correspondingly, the latency for the non-throttled user improves while that for the throttled user degrades.  This is shown in the last two columns in the table.

The charts above show that the change in throughput is linear with the amount of throttling, for both the throttled and non-throttled user.  With regard to latency, the change is generally linear, until the throttle becomes very restrictive; in this case, latency for the throttled user degrades substantially.

One point that should be noted is that, while the throttle parameter in req/sec is indeed correlated to the actual restriction in throughput as reported by YCSB (ops/sec) as seen by the trend in column 4, the actual figures differ.  As user “T”’s throughput is restricted down from 2500 to 500 req/sec, the observed throughput goes down from 2500 ops/sec to 1136 ops/sec.  Therefore, users should calibrate the throttle to their workload to determine the appropriate figure to use (either req/sec or MB/sec) in their case.

Read + Scan Workload

Throttle (req/sec)

Thruput_U (ops/sec)

Thruput_T (ops/sec)

Latency_U (ms)

Latency_T (ms)

3000 Krps





1000 Krps





500 Krps





250 Krps





50 Krps






image(2).pngWith the read/scan workload, similar results are observed as in the read/read workload.  As the extent of throttling is increased for the long-running scan workload, the observed throughput decreases and latency increases.  Conversely, the read workload benefits. displaying better throughput and improved latency.  Again, the specific numeric value used to specify the throttle needs to be calibrated to the workload at hand.  Since scans break down into a large number of read requests, the throttle parameter needs to be much higher than in the case with the read workload.  Shown above is a log-linear chart of the impact on throughput of the two workloads when the extent of throttling is adjusted.


HBase request throttling is an effective and useful technique to handle multiple workloads, or even multi-tenant workloads on an HBase cluster.  A cluster administrator can choose to throttle long-running or lower-priority workloads, knowing that regionserver resources will get re-directed to the other workloads, without this feature imposing a significant overhead.  By calibrating the throttle to the cluster and the workload, the desired performance can be achieved on clusters running multiple concurrent workloads.

Friday May 01, 2015

Scan Improvements in HBase 1.1.0

Jonathan Lawlor, Apache HBase Contributor

Over the past few months there have a been a variety of nice changes made to scanners in HBase. This post focuses on two such changes, namely RPC chunking (HBASE-11544) and scanner heartbeat messages (HBASE-13090). Both of these changes address long standing issues in the client-server scan protocol. Specifically, RPC chunking deals with how a server handles the scanning of very large rows and scanner heartbeat messages allow scan operations to progress even when aggressive server-side filtering makes infrequent result returns.


In order to discuss these issues, lets first gain a general understanding of how scans currently work in HBase.

From an application's point of view, a ResultScanner is the client side source of all of the Results that an application asked for. When a client opens a scan, it’s a ResultScanner that is returned and it is against this object that the client invokes next to fetch more data. ResultScanners handle all communication with the RegionServers involved in a scan and the ResultScanner decides which Results to make visible to the application layer. While there are various implementations of the ResultScanner interface, all implementations use basically the same protocol to communicate with the server.

In order to retrieve Results from the server, the ResultScanner will issue ScanRequests via RPC's to the different RegionServers involved in a scan. A client configures a ScanRequest by passing an appropriately set Scan instance when opening the scan setting start/stop rows, caching limits, the maximum result size limit, and the filters to apply.

On the server side, there are three main components involved in constructing the ScanResponse that will be sent back to the client in answer to a ScanRequest:


The RSRpcService is a service that lives on RegionServers that can respond to incoming RPC requests, such as ScanRequests. During a scan, the RSRpcServices is the server side component that is responsible for constructing the ScanResponse that will be sent back to the client. The RSRpcServices continues to scan in order to accumulate Results until the region is exhausted, the table is exhausted, or a scan limit is reached (such as the caching limit or max result size limit). In order to retrieve these Results, the RSRpcServices must talk to a RegionScanner


The RegionScanner is the server side component responsible for scanning the different rows in the region. In order to scan through the rows in the region, the RegionScanner will talk to a one or more different instances of StoreScanners (one per column family in the row). If the row passes all filtering criteria, the RegionScanner will return the Cells for that row to the RSRpcServices so that they can be used to form a Result to include in the ScanResponse.


The StoreScanner is the server side component responsible for scanning through the Cells in each column family.

When the client (i.e. ResultScanner) receives the ScanResponse back from the server, it can then decide whether or not scanning should continue. Since the client and the server communicate back and forth in chunks of Results, the client-side ResultScanner will cache all the Results it receives from the server. This allows the application to perform scans faster than the case where an RPC is required for each Result that the application sees.

RPC Chunking (HBASE-11544)

Why is it necessary?

Currently, the server sends back whole rows to the client (each Result contains all of the cells for that row). The max result size limit is only applied at row boundaries. After each full row is scanned, the size limit will be checked.

The problem with this approach is that it does not provide a granular enough restriction. Consider, for example, the scenario where each row being scanned is 100 MB. This means that 100 MB worth of data will be read between checks for the size limit. So, even in the case that the client has specified that the size limit is 1 MB, 100 MB worth of data will be read and then returned to the client.

This approach for respecting the size limit is problematic. First of all, it means that it is possible for the server to run out of memory and crash in the case that it must scan a large row. At the application level, you simply want to perform a scan without having to worry about crashing the region server.

This scenario is also problematic because it means that we do not have fine grained control over the network resources. The most efficient use of the network is to have the client and server talk back and forth in fixed sized chunks.

Finally, this approach for respecting the size limit is a problem because it can lead to large, erratic allocations server side playing havoc with GC.

Goal of the RPC Chunking solution

The goal of the RPC Chunking solution was to:

- Create a workflow that is more ‘regular’, less likely to cause Region Server distress

- Use the network more efficiently

- Avoid large garbage collections caused by allocating large blocks

Furthermore, we wanted the RPC chunking mechanism to be invisible to the application. The communication between the client and the server should not be a concern of the application. All the application wants is a Result for their scan. How that Result is retrieved is completely up to the protocol used between the client and the server.

Implementation Details

The first step in implementing this proposed RPC chunking method was to move the max result size limit to a place where it could be respected at a more granular level than on row boundaries. Thus, the max result size limit was moved down to the cell level within StoreScanner. This meant that now the max result size limit could be checked at the cell boundaries, and if in excess, the scan can return early.

The next step was to consider what would occur if the size limit was reached in the middle of a row. In this scenario, the Result sent back to the client will be marked as a "partial" Result. By marking the Result with this flag, the server communicates to the client that the Result is not a complete view of the row. Further Scan RPC's must be made in order to retrieve the outstanding parts of the row before it can be presented to the application. This allows the server to send back partial Results to the client and then the client can handle combining these partials into "whole" row Results. This relieves the server of the burden of having to read the entire row at a time.

Finally, these changes were performed in a backwards compatible manner. The client must indicate in its ScanRequest that partial Results are supported. If that flag is not seen server side, the server will not send back partial results. Note that the configuration hbase.client.scanner.max.result.size can be used to change the default chunk size. By default this value is 2 MB in HBase 1.1+.

An option (Scan#setAllowPartialResults) was also added so that an application can ask to see partial results as they are returned rather than wait on the aggregation of complete rows.

A consistent view on the row is maintained even though a row is the result of multiple RPC partials because the running context server-side keeps account of the outstanding mvcc read point and will not include in results Cells written later.

Note that this feature will be available starting in HBase 1.1. All 1.1+ clients will chunk their responses.

Heartbeat messages for scans (HBASE-13090)

What is a heartbeat message?

A heartbeat message is a message used to keep the client-server connection alive. In the context of scans, a heartbeat message allows the server to communicate back to the client that scan progress has been made. On receipt, the client resets the connection timeout. Since the client and the server communicate back and forth in chunks of Results, it is possible that a single progress update will contain ‘enough’ Results to satisfy the requesting application’s purposes. So, beyond simply keeping the client-server connection alive and preventing scan timeouts, heartbeat messages also give the calling application an opportunity to decide whether or not more RPC's are even needed.

Why are heartbeat messages necessary in Scans?

Currently, scans execute server side until the table is exhausted, the region is exhausted, or a limit is reached. As such there is no way of influencing the actual execution time of a Scan RPC. The server continues to fetch Results with no regard for how long that process is taking. This is particularly problematic since each Scan RPC has a defined timeout. Thus, in the case that a particular scan is causing timeouts, the only solution is to increase the timeout so it spans the long-running requests (hbase.client.scanner.timeout.period). You may have encountered such problems if you have seen exceptions such as OutOfOrderScannerNextException.

Goal of heartbeat messages

The goal of the heartbeating message solution was to:

- Incorporate a time limit concept into the client-server Scan protocol

- The time limit must be enforceable at a granular level (between cells)

- Time limit enforcement should be tunable so that checks do not occur too often

Beyond simply meeting these goals, it is also important that, similar to RPC chunking, this mechanism is invisible to the application layer. The application simply wants to perform its scan and get a Result back on a call to ResultScanner.next(). It does not want to worry about whether or not the scan is going to take too long and if it needs to adjust timeouts or scan sizings.

Implementation details

The first step in implementing heartbeat messages was to incorporate a time limit concept into the Scan RPC workflow. This time limit was based on the configured value of the client scanner timeout.

Once a time limit was defined, we had to decide where this time limit should be respected. It was decided that this time limit should be enforced within the StoreScanner. The reason for enforcing this limit inside store scanner was that it allowed the time limit to be enforced at the cell boundaries. It is important that the time limit be checked at a fine grained location because, in the case of restrictive filtering or time ranges, it is possible that large portions of time will be spent filtering out and skipping cells. If we wait to check the time limit at the row boundaries, it is possible when the row is wide that we may timeout before a single check occurs.

A new configuration was introduced (hbase.cells.scanned.per.heartbeat.check) to control how often these time limit checks occur. The default value of this configuration is 10,000 meaning that we check the time limit every 10,000 cells.

Finally, if this time limit is reached, the ScanResponse sent back to the client is marked as a heartbeat message. It is important that the ScanResponse be marked in this way because it communicates to the client the reason the message was sent back from the server. Since a time limit may be reached before any Results are retrieved, it is possible that the heartbeat message's list of Results will be empty. We do not want the client to interpret this as meaning that the region was exhausted.

Note that similar to RPC chunking, this feature was implemented in a fully backwards compatible manner. In other words, heartbeat messages will only be sent back to the client in the case that the client has specified that it can recognize when a ScanResponse is a heartbeat message. Heartbeat messages will be available starting in HBase 1.1. All HBase 1.1+ clients will heartbeat.

Cleaning up the Scan API (HBASE-13441)

Following these recent improvements to the client-server Scan protocol, there is now an effort to try and cleanup the Scan API. There are some API’s that are getting stale, and don’t make much sense anymore especially in light of the above changes. There are also some API’s that could use some love with regards to documentation.

For example the caching and max result size API’s deal with how data is transferred between the client and the server. Both of these API’s now seem misplaced in the Scan API. These are details that should likely be controlled by the RegionServer rather than the application. It seems much more appropriate to give the RegionServer control over these parameters so that it can tune them based on the current state of the RPC pipeline and server loadings.

HBASE-13441 Scan API Improvements is the open umbrella issue covering ideas for Scan API improvements. If you have some time, check it out. Let us know if you have some ideas for improvements that you would like to see.

Other recent Scanner improvements

It’s important to note that these are not the only improvements that have been made to Scanners recently. Many other impressive improvements have come through deserving of their own blog post. For those that are interested, I have listed two:

  • HBASE-13109: Make better SEEK vs SKIP decisions during scanning

  • HBASE-13262: Fixed a data loss issue in Scans


A spate of Scan improvements should make for a smoother Scan experience in HBase 1.1.

Thursday Apr 16, 2015

Come to HBaseCon2015!

Come learn about:hbasecon.com

  • A highly-trafficked HBase cluster with an uptime of sixteen months
  • An HBase deploy that spans three datacenters doing master-master replication between thousands of HBase nodes in each
  • Some nuggets on how Bigtable does it (and HBase could too)
  • How HBase is being used to record patient telemetry 24/7 on behalf of the Michael J. Fox Foundation to enable breakthroughs in Parkinson Disease research
  • How Pinterest and Microsoft are doing it in the cloud, how FINRA and Bloomberg are doing it out east, and how Rocketfuel, Yahoo! and Flipboard, etc., are representing the west

HBaseCon 2015 is the fourth annual edition of the community event for Apache HBase contributors, developers, admins, and users of all skill levels. The event is hosted and organized by Cloudera, with the Program Committee including leaders from across the HBase community. The conference will be one full day comprising general sessions in the morning, breakout sessions in the morning and afternoon, and networking opportunities throughout the day.

The agenda hase been posted here, hbasecon2015 agenda. You can register here.


The HBaseCon Program Committee 

Thursday Mar 05, 2015

HBase ZK-less Region Assignment

ZK-less region assignment allows us to achieve greater scale as well as do faster startups and assignment. It is simpler and has less code, improves the speed at which assignments run so we can do faster rolling restarts. This feature will be on by default in HBase 2.0.0.[Read More]

Tuesday Feb 24, 2015

Start of a new era: Apache HBase 1.0

Past, present and future state of the community

Author: Enis Söztutar, Apache HBase PMC member and HBase-1.0.0 release manager

The Apache HBase community has released Apache HBase 1.0.0. Seven years in the making, it marks a major milestone in the Apache HBase project’s development, offers some exciting features and new API’s without sacrificing stability, and is both on-wire and on-disk compatible with HBase 0.98.x.

In this blog, we look at the past, present and future of Apache HBase project. 

Versions, versions, versions 

Before enumerating feature details of this release let’s take a journey into the past and how release numbers emerged. HBase started its life as a contrib project in a subdirectory of Apache Hadoop, circa 2007, and released with Hadoop. Three years later, HBase became a standalone top-level Apache project. Because HBase depends on HDFS, the community ensured that HBase major versions were identical and compatible with Hadoop’s major version numbers. For example, HBase 0.19.x worked with Hadoop 0.19.x, and so on.

However, the HBase community wanted to ensure that an HBase version can work with multiple Hadoop versions—not only with its matching major release numbers Thus, a new naming scheme was invented where the releases would start at the close-to-1.0 major version of 0.90, as show above in the timeline. We also took on an even-odd release number convention where releases with odd version numbers were “developer previews” and even-numbered releases were “stable” and ready for production. The stable release series included 0.90, 0.92, 0.94, 0.96 and 0.98 (See HBase Versioning for an overview).

After 0.98, we named the trunk version 0.99-SNAPSHOT, but we officially ran out of numbers! Levity aside, last year, the HBase community agreed that the project had matured and stabilized enough such that a 1.0.0 release was due. After three releases in the 0.99.x series of “developer previews” and six Apache HBase 1.0.0 release candidates, HBase 1.0.0 has now shipped! See the above diagram, courtesy of Lars George, for a timeline of releases. It shows each release line together with the support lifecycle, and any previous developer preview releases if any (0.99->1.0.0 for example).

HBase-1.0.0, start of a new era

The 1.0.0 release has three goals:

1) to lay a stable foundation for future 1.x releases;

2) to stabilize running HBase cluster and its clients; and

3) make versioning and compatibility dimensions explicit 

Including previous 0.99.x releases, 1.0.0 contains over 1500 jiras resolved. Some of the major changes are: 

API reorganization and changes

HBase’s client level API has evolved over the years. To simplify the semantics and to support and make it extensible and easier to use in the future, we revisited the API before 1.0. To that end, 1.0.0 introduces new APIs, and deprecates some of the commonly-used client side APIs (HTableInterface, HTable and HBaseAdmin).

We advise you to update your application to use the new style of APIs, since deprecated APIs will be removed in the future 2.x series of releases. For further guidance, please visit these two decks: http://www.slideshare.net/xefyr/apache-hbase-10-release and http://s.apache.org/hbase-1.0-api.

All Client side APIs are marked with the InterfaceAudience.Public class, indicating if a class/method is an official "client API" for HBase (See “11.1.1. HBase API Surface” in the HBase Refguide for more details on the Audience annotations). Going forward, all 1.x releases are planned to be API compatible for classes annotated as client public.

Read availability using timeline consistent region replicas

As part of phase 1, this release contains an experimental "Read availability using timeline consistent region replicas" feature. That is, a region can be hosted in multiple region servers in read-only mode. One of the replicas for the region will be primary, accepting writes, and other replicas will share the same data files. Read requests can be done against any replica for the region with backup RPCs for high availability with timeline consistency guarantees. See JIRA HBASE-10070 for more details.

Online config change and other forward ports from 0.89-fb branch

The 0.89-fb branch in Apache HBase was where Facebook used to post their changes. HBASE-12147 JIRA forward ported the patches which enabled reloading a subset of the server configuration without having to restart the region servers.

Apart from the above, there are hundreds of improvements, performance (improved WAL pipeline, using disruptor, multi-WAL, more off-heap, etc) and bug fixes and other goodies that are too long to list here. Check out the official release notes for a detailed overview. The release notes and the book also cover binary, source and wire compatibility requirements, supported Hadoop and Java versions, upgrading from 0.94, 0.96 and 0.98 versions and other important details.

HBase-1.0.0 is also the start of using “semantic versioning” for HBase releases. In short, future HBase releases will have MAJOR.MINOR.PATCH version with the explicit semantics for compatibility. The HBase book contains all the dimensions for compatibility and what can be expected between different versions.

What’s next

We have marked HBase-1.0.0 as the next stable version of HBase, meaning that all new users should start using this version. However, as a database, we understand that switching to a newer version might take some time. We will continue to maintain and make 0.98.x releases until the user community is ready for its end of life. 1.0.x releases as well as 1.1.0, 1.2.0, etc line of releases are expected to be released from their corresponding branches, while 2.0.0 and other major releases will follow when their time arrives.

Read replicas phase 2, per column family flush, procedure v2, SSD for WAL or column family data, etc are some of the upcoming features in the pipeline. 


Finally, the HBase 1.0.0 release has come a long way, with contributions from a very large group of awesome people and hard work from committers and contributors. We would like to extend our thanks to our users and all who have contributed to HBase over the years.

Keep HBase’ing!

Friday Aug 08, 2014

Comparing BlockCache Deploys

Comparing BlockCache Deploys

St.Ack on August 7th, 2014

A report comparing BlockCache deploys in Apache HBase for issue HBASE-11323 BucketCache all the time! Attempts to roughly equate five different deploys and compare how well they do under four different loading types that vary from no cache-misses through to missing cache most of the time. Makes recommendation on when to use which deploy.


In Nick Dimiduk's BlockCache 101 blog post, he details the different options available in HBase. We test what remains after purging SlabCache and the caching policy implementation DoubleBlockCache which have been deprecated in HBase 0.98 and removed in trunk because of Nick's findings and that of others.

Nick varies the JVM Heap+BlockCache sizes AND dataset size.  This report keeps JVM Heap+BlockCache size constant and varies the dataset size only. Nick looks at the 99th percentile only.  This article looks at that, as well as GC, throughput and loadings. Cell sizes in the following tests also vary between 1 byte and 256k in size.


If the dataset fits completely in cache, the default configuration, which uses the onheap LruBlockCache, performs best.  GC is half that of the next most performant deploy type, CombinedBlockCache:Offheap with at least 20% more throughput.

Otherwise, if your cache is experiencing churn running a steady stream of evictions, move your block cache offheap using CombinedBlockCache in the offheap mode. See the BlockCache section in the HBase Reference Guide for how to enable this deploy. Offheap mode requires only one third to one half of the GC of LruBlockCache when evictions are happening. There is also less variance as the cache misses climb when offheap is deployed. CombinedBlockCache:file mode has better GC profile but less throughput than CombinedBlockCache:Offheap. Latency is slightly higher on CBC:Offheap -- heap objects to cache have to be serialized in and out of the offheap memory -- than with the default LruBlockCache but the 95/99th percentiles are slightly better.  CBC:Offheap uses a bit more CPU than LruBlockCache. CombinedBlockCache:Offheap is limited only by the amount of RAM available.  It is not subject to GC.


The graphs to follow show results from five different deploys each run through four different loadings.

Five Deploy Variants

  1. LruBlockCache The default BlockCache in place when you start up an unconfigured HBase install. With LruBlockCache, all blocks are loaded into the java heap. See BlockCache 101 and LruBlockCache for detail on the caching policy of LruBlockCache.
  2. CombinedBlockCache:Offheap CombinedBlockCache deploys two tiers; an L1 which is an LruBlockCache instance to hold META blocks only (i.e. INDEX and BLOOM blocks), and an L2 tier which is an instance of BucketCache. In this offheap mode deploy, the BucketCache uses DirectByteBuffers to host a BlockCache outside of the JVM heap to host cached DATA blocks.
  3. CombinedBlockCache:Onheap In this onheap ('heap' mode), the L2 cache is hosted inside the JVM heap, and appears to the JVM to be a single large allocation. Internally it is managed by an instance of BucketCache The L1 cache is an instance of LruBlockCache.
  4. CombinedBlockCache:file In this mode, an L2 BucketCache instance puts DATA blocks into a file (hence 'file' mode) on a mounted tmpfs in this case.
  5. CombinedBlockCache:METAonly No caching of DATA blocks (no L2 instance). DATA blocks are fetched every time. The INDEX blocks are loaded into an L1 LruBlockCache instance.

Memory is fixed for each deploy. The java heap is a small 8G to bring on distress earlier. For deploy type 1., the LruBlockCache is given 4G of the 8G JVM heap. For 2.-5. deploy types., the L1 LruHeapCache is 0.1 * 8G (~800MB), which is more than enough to host the dataset META blocks. This was confirmed by inspecting the Block Cache vitals displayed in the RegionServer UI. For 2., 3., and 4. deploy types, the L2 bucket cache is 4G. Deploy type 5.'s L2 is 0G (Used HBASE-11581 Add option so CombinedBlockCache L2 can be null (fscache)).

Four Loading Types

  1. All cache hits all the time.
  2. A small percentage of cache misses, with all misses inside the fscache soon after the test starts.
  3. Lots of cache misses, all still inside the fscache soon after the test starts.
  4. Mostly cache misses where many misses by-pass the fscache.

For each loading, we ran 25 clients reading randomly over 10M cells of zipfian varied cell sizes from 1 byte to 256k bytes over 21 regions hosted by a single RegionServer for 20 minutes. Clients would stay inside the cache size for loading 1., miss the cache at just under ~50% for loading 2., and so on.

font-family: Times; font-size: medium; margin-bottom: 0in; line-height: 18.239999771118164px;">The dataset was hosted on a single RegionServer on small HDFS cluster of 5 nodes.  See below for detail on the setup.


For each attribute -- GC, Throughput, i/o -- we have five charts across; one for each deploy.  Each graph can be divided into four parts; the first quarter has the all-in-cache loading running for 20minutes, followed by the loading that has some cache misses (for twenty minutes), through to loading with mostly cache misses in the final quarter.

To find out what each color represents, read the legend.  The legend is small in the below. To see a larger version, browse to this non-apache-roller version of this document and click on the images over there.

Concentrate on the first four graph types.  The BlockCache Stats and I/O graphs add little other than confirming that the loadings ran the same across the different BlockCache deploys with fscache cutting in to soak up the seeks soon after start for all but the last mostly-cache-miss loading cases.


Be sure to check the y-axis in the graphs below (you can click on chart to get a larger version). All profiles look basically the same but a check of the y-axis will show that for all but the no-cache-misses case, CombinedBlockCache:Offheap, the second deploy type, has the best GC profile (less GC is better!).

The GC for the CombinedBLockCache:Offheap deploy mode looks to be climbing as the test runs.  See the Notes section on the end for further comment.

Image map


CombinedBlockCache:Offheap is better unless there are few to no cache misses.  In that case, LruBlockCache shines (I missed why there is the step in the LruBlockCache all-in-cache section of the graph)

Image map


Image map

Latency when in-cache

Same as above except that we do not show the last loading -- we show the first three only -- since the last loading of mostly cache misses skews the diagrams such that it is hard to compare latency when we are inside cache (BlockCache and fscache).
Image map


Try aggregating the system and user CPUs when comparing.
Image map

BlockCache Stats

Curves are mostly the same except for the cache-no-DATA-blocks case. It has no L2 deployed.

Image map


Read profiles are about the same across all deploys and tests with a spike at first until the fscache comes around and covers. The exception is the final mostly-cache-misses case.
Image map


Master branch. 2.0.0-SNAPSHOT, r69039f8620f51444d9c62cfca9922baffe093610.  Hadoop 2.4.1-SNAPSHOT.

5 nodes all the same with 48G and six disks.  One master and one regionserver, each on distinct nodes, with 21 regions of 10M rows of zipf varied size -- 0 to 256k -- created as follows:

$ export HADOOP_CLASSPATH=/home/stack/conf_hbase:`./hbase-2.0.0-SNAPSHOT/bin/hbase classpath`
$ nohup  ./hadoop-2.4.1-SNAPSHOT/bin/hadoop --config /home/stack/conf_hadoop org.apache.hadoop.hbase.PerformanceEvaluation --valueSize=262144 --valueZipf --rows=100000 sequentialWrite 100 &

Here is my loading script.  It performs 4 loadings: All in cache, just out of cache, a good bit out of cache and then mostly out of cache:

[stack@c2020 ~]$ more bin/bc_in_mem.sh
date=`date -u +"%Y-%m-%dT%H:%M:%SZ"
echo testtype=$testtype $date` >> nohup.out
#for i in 38 76 304 1000; do
for i in 32 72 144 1000; do
  echo "`date` run size=${i}, clients=$clients ; $testtype time=$runtime size=$i" >> nohup.out
  timeout $runtime nohup ${HBASE_HOME}/bin/hbase --config /home/stack/conf_hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --valueSize=110000 --size=$i --cycles=$cycles randomRead $clients
Java version:
[stack@c2020 ~]$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

I enabled caching options by uncommenting these configuration properties in hbase-site.xml . The hfile.block.cache.size was set to .5 to keep math simple.
<!--LRU Cache-->

<!--Bucket cache-->


The CombinedBlockCache has some significant overhead (to be better quantified -- 10%?). There will be slop left over because buckets will not fill to the brim especially when block sizes vary.  Observed running loadings.

I tried to bring on a FullGC by running for more than a day with LruBlockCache fielding mostly BlockCache misses and I failed. Thinking on it, the BlockCache only occupied 4G of an 8G heap. There was always elbow room available (Free block count and maxium block size allocatable settled down to a nice number and held constant). TODO: Retry but with less elbow room available.

Longer running CBC offheap test

Some of the tests above show disturbing GC tendencies -- ever rising GC -- so I ran tests for a longer period .  It turns out that BucketCache throws an OOME in long-running tests. You need HBASE-11678 BucketCache ramCache fills heap after running a few hours (fixed in hbase-0.98.6+) After the fix, the below test ran until the cluster was pul led out from under the loading:
gc over three daysThro    ughput over three days
Here are some long running tests with bucket cache onheap (ioengine=heap). The throughput is less and GC is higher.
gc over three days onheapThroughput over three days onheap


Does LruBlockCache get more erratic as heap size grows?  Nick's post implies that it does.
Auto-sizing of L1, onheap cache.

HBASE-11331 [blockcache] lazy block decompression has a report attachedthat evaluates the current state of attached patch to keep blocks compressed while in the BlockCache.  It finds that with the patch enabled, "More GC. More CPU. Slower. Less throughput. But less i/o.", but there is an issue with compression block pooling that likely impinges on performance.

Review serialization in and out of L2 cache. No serialization?  Because we can take blocks from HDFS without copying onheap?

Friday Apr 11, 2014

The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size

By Doug Meil, HBase Committer and Thomas Murphy


One of the most common questions in the HBase user community is estimating disk footprint of tables, which translates into HFile size – the internal file format in HBase.

We designed an experiment at Explorys where we ran combinations of design time options (rowkey length, column name length, row storage approach) and runtime options (HBase ColumnFamily compression, HBase data block encoding options) to determine these factors’ effects on the resultant HFile size in HDFS.

HBase Environment

CDH4.3.0 (HBase

Design Time Choices

  1. Rowkey

    1. Thin

      1. 16-byte MD5 hash of an integer.

    2. Fat

      1. 64-byte SHA-256 hash of an integer.

    1. Note: neither of these are realistic rowkeys for real applications, but they chosen because they are easy to generate and one is a lot bigger than the other.

  1. Column Names

    1. Thin

      1. 2-3 character column names (c1, c2).

    2. Fat

      1. 10 characters, randomly chosen but consistent for all rows.

    1. Note: it is advisable to have small column names, but most people don’t start that way so we have this as an option.

  1. Row Storage Approach

    1. Key Value Per Column

      1. This is the traditional way of storing data in HBase.

    2. One Key Value per row.

      1. Actually, two.

      2. One KV has an Avro serialized byte array containing all the data from the row.

      3. Another KV holds an MD5 hash of the version of the Avro schema.

Run Time

  1. Column Family Compression

    1. None

    2. GZ

    3. LZ4

    4. LZO

    5. Snappy

    1. Note: it is generally advisable to use compression, but what if you didn’t? So we tested that too.

  1. HBase Block Encoding

    1. None

    2. Prefix

    3. Diff

    4. Fast Diff

    1. Note: most people aren’t familiar with HBase Data Block Encoding. Primarily intended for squeezing more data into the block cache, it has effects on HFile size too. See HBASE-4218 for more detail.

1000 rows were generated for each combination of table parameters. Not a ton of data, but we don’t necessarily need a ton of data to see the varying size of the table. There were 30 columns per row comprised of 10 strings (each filled with 20 bytes of random characters), 10 integers (random numbers) and 10 longs (also random numbers).

HBase blocksize was 128k.


The easiest way to navigate the results is to compare specific cases, progressing from an initial implementation of a table to options for production.

Case #1: Fat Rowkey and Fat Column Names, Now What?

This is where most people start with HBase. Rowkeys are not as optimal as they should be (i.e., the Fat rowkey case) and column names are also inflated (Fat column-names).

Without CF Compression or Data Block Encoding, the baseline is:






What if we just changed CF compression?

This drastically changes the HFile footprint. Snappy compression reduces the HFile size from 6.2 Mb to 1.8 Mb, for example.

















However, we shouldn’t be too quick to celebrate. Remember that this is just the disk footprint. Over the wire the data is uncompressed, so 6.2 Mb is still being transferred from RegionServer to Client when doing a Scan over the entire table.

What if we just changed data block encoding?

Compression isn’t the only option though. Even without compression, we can change the data block encoding and also achieve HFile reduction. All options are an improvement over the 6.2 Mb baseline.














The following table shows the results of all remaining CF compression / data block encoding combinations.













































Case #2: What if we re-designed the column names (and left the rowkey alone)?

Let’s assume that we re-designed our column names but left the rowkey alone. After using the “thin” column-names without CF compression or data block encoding it results in an HFile 5.8 Mb in size. This is an improvement from the original 6.2 Mb baseline. It doesn’t seem like much, but it’s still a 6.5% reduction in the eventual wire-transfer footprint.






Applying Snappy compression can reduce the HFile size further:

















Case #3: What if we re-designed the rowkey (and left the column names alone)?

In this example, what if we only redesigned the rowkey? After using the “thin” rowkey the result is an HFile size that is 4.9 Mb down from the 6.2 Mb baseline, a 21% reduction. Not a small savings!






Applying Snappy compression can reduce the HFile size further:

















However, note that the resulting HFile size with Snappy and no data block encoding (1.7 Mb) is very similar in size to the baseline approach (i.e., fat rowkeys, fat column-names) with Snappy and no data block encoding (1.8 Mb). Why? The CF compression can compensate on disk for a lot of bloat in rowkeys and column names.

Case #4: What if we re-designed both the rowkey and the column names?

By this time we’ve learned enough HBase to know that we need to have efficient rowkeys and column-names. This produces an HFile that is 4.4 Mb, a 29% savings over the baseline of 6.2 Mb.





Applying Snappy compression can reduce the HFile size further:

















Again, the on-disk footprint with compression isn’t radically different from the others, as Compression can compensate to large degree for rowkey and column name bloat.

Case #5: KeyValue Storage Approach (e.g., 1 KV vs. KV-per-Column)

What if we did something radical and changed how we stored the data in HBase? With this approach, we are using a single KeyValue per row holding all of the columns of data for the row instead of a KeyValue per column (the traditional way).

The resulting HFile, even uncompressed and without Data Block Encoding, is radically smaller at 1.4 Mb compared to 6.2 Mb.






Adding Snappy compression and Data Block Encoding makes the resulting HFile size even smaller.

















Compare the 1.1 Mb Snappy without encoding to the 1.7 Snappy encoded Thin rowkey/Thin column-name.


Although Compression and Data Block Encoding can wallpaper over bad rowkey and column-name decisions in terms of HFile size, you will pay the price for this in terms of data transfer from RegionServer to Client. Also, concealing the size penalty brings with it a performance penalty each time the data is accessed or manipulated. So, the old advice about correctly designing rowkeys and column names still holds.

In terms of KeyValue approach, having a single KeyValue per row presents significant savings both in terms of data transfer (RegionServer to Client) as well as HFile size. However, there is a consequence with this approach in having to update each row entirely, and that old versions of rows also be stored in their entirety (i.e., as opposed to column-by-column changes). Furthermore, it is impossible to scan on select columns; the whole row must be retrieved and deserialized to access any information stored in the row. The importance of understanding this tradeoff cannot be over-stated, and is something that must be evaluated on an application-by-application basis.

Software engineering is an art of managing tradeoffs, so there isn’t necessarily one “best” answer. Importantly, this experiment only measures the file size and not the time or processor load penalties imposed by the use of compression, encoding, or Avro. The results generated in this test are still based on certain assumptions and your mileage may vary.

Here is the data if interested: http://people.apache.org/~dmeil/HBase_HFile_Size_2014_04.csv

Monday Mar 24, 2014

HBaseCon2014, the HBase Community Event, May 5th in San Francisco


By the HBaseCon Program Committee and Justin Kestelyn

HBaseCon 2014 (May 5, 2014 in San Francisco) is coming up fast.  It is shaping up to be another grand day out for the HBase community with a dense agenda spread over four tracks of case studies, dev, ops, and ecosystem.  Check out our posted agenda grid. As in previous years, the program committee favored presentations on open source, shipping technologies and descriptions of in-production systems. The schedule is chock-a-block with a wide diversity of deploy types (industries, applications, scALE!) and interesting dev and operations stories.  Come to the conference and learn from your peer HBase users, operators and developers.

The general session will lead off with a welcome by Mike Olson of Cloudera (Cloudera is the host for this HBase community conference).

Next up will be a talk by the leads of the BigTable team at Google, Avtandil Garakanidze and Carter Page.  They will present on BigTable, the technology that HBase is based on.  Bigtable is “...the world’s largest multi-purpose database, supporting 90 percent of Google’s applications around the world.”  They will give a brief overview of “...Bigtable evolution since it was originally described in an OSDI ’06 paper, its current use cases at Google, and future directions.”

Then comes our own Liyin Tang of Facebook and Lars Hofhansl of Salesforce.com.  Liyin will speak on, “HydraBase: Facebook’s Highly Available and Strongly Consistent Storage Service Based on Replicated HBase Instances”.  HBase powers multiple mission-critical online applications at Facebook. This talk will cover the design of HydraBase (including the replication protocol), an analysis of a failure scenario, and a plan for contributions to the HBase trunk.”  Lars will explain how Salesforce.com’s scalability requirements led it to HBase and the multiple use cases for HBase there today. You’ll also learn how Salesforce.com works with the HBase community, and get a detailed look into its operational environment.

We then break up into sessions where you will be able to learn about operations best practices, new features in depth, and what is being built on top of HBase. You can also hear how Pinterest builds large scale HBase apps up in AWS, HBase at Bloomberg, OPower, RocketFuel, Xiaomi, Yahoo! and others, as well as a nice little vignette on how HBase is an enabling technology at OCLC replacing all instances of Oracle powering critical services for the libraries of the world.

There will be an optional pre-conference prep for attendees who are new to HBase by Jesse Anderson (Cloudera University).  Jesse’s talk will offer a brief Cliff’s Notes-level HBase overview before the General Session that covers architecture, API, and schema design.

Register now for HBaseCon 2014 at: https://hbasecon2014.secure.mfactormeetings.com/registration/

A big shout out goes out to our sponsors – Continuuity, Hortonworks, Intel, LSI, MapR, Salesforce.com, Splice Machine, WibiData (Gold); Facebook, Pepperdata (Silver); ASF (Community); O’Reilly Media, The Hive, NoSQL Weekly (Media) — without which HBaseCon would be impossible!

Friday Jan 10, 2014

HBase Cell Security

By Andrew Purtell, HBase Committer and member of the Intel HBase Team


Apache HBase is “the Apache Hadoop database”, a horizontally scalable nonrelational datastore built on top of components offered by the Apache Hadoop ecosystem, notably Apache ZooKeeper and Apache Hadoop HDFS. Although HBase therefore offers first class Hadoop integration, and is often chosen for that reason, it has come into its own as a good choice for high scale data storage of record. HBase is often the kernel of big data infrastructure.

There are many advantages offered by HBase: it is free open source software, it offers linear and modular scalability, it has strictly consistent reads and writes, automatic and configurable sharding and automatic failover are core concerns, and it has a deliberate tolerance for operation on “commodity” hardware, which is a very significant benefit at high scale. As more organizations are faced with Big Data challenges, HBase is increasingly found in roles formerly occupied by traditional data management solutions, gaining new users with new perspectives and the requirements and challenges of new use cases. Some of those users have a strong interest in security. They may be a healthcare provider operating within a strict regulatory regime regarding the access and sharing of patient information. They might be a consumer web property in a jurisdiction with strong data privacy laws. They could be a military or government agency that must manage a strict separation between multiple levels of information classification.

Access Control Lists and Security Labels

For some time Apache HBase has offered a strong security model based on Kerberos authentication and access control lists (ACLs). When Yahoo first made a version of Hadoop capable of strong authentication available in 2009, my team and I at a former employer, a commercial computer security vendor, were very interested. We needed a scale out platform for distributed computation, but we also needed one that could provide assurance against access of information by unauthorized users or applications. Secure Hadoop was a good fit. We also found in HBase a datastore that could scale along with computation, and offered excellent Hadoop integration, except first we needed to add feature parity and integration with Secure Hadoop. Our work was contributed back to Apache as HBASE-2742 and ZOOKEEPER-938. We also contributed an ACL-based authorization engine as HBASE-3025, and the coprocessor framework upon which it was built as HBASE-2000 and others. As a result, in the Apache family of Big Data storage options, HBase was the first to offer strong authentication and access control. These features have been improved and evolved by the HBase community many times over since then.

An access control list, or ACL, is a list of permissions associated with an object. The ACL specifies which subjects (users or system processes) shall be granted access to those objects, as well as which operations are allowed. Each entry in an ACL describes a subject and the operation(s) the subject is entitled to perform. When a subject requests an operation, HBase first uses Hadoop’s strong authentication support, based on Kerberos, to verify the subject’s identity. HBase then finds the relevant ACLs, and checks the entries of those ACLs to decide whether or not the request is authorized. This is an access control model that provides a lot of assurance, and is a flexible way for describing security policy, but not one that addresses all possible needs. We can do more.

All values written to HBase are stored in what is known as a cell. (“Cell” is used interchangeably with “key-value” or “KeyValue”, mainly for legacy reasons.) Cells are identified by a multidimensional key: { row, column, qualifier, timestamp }. (“Column” is used interchangeably with “family” or “column family”.) The table is implicit in every key, even if not actually present, because all cells are written into columns, and every column belongs to a table. This forms a hierarchical relationship: 

table -> column family -> cell

Users can today grant individual access permissions to subjects on tables and column families. Table permissions apply to all column families and all cells within those families. Column family permissions apply to all cells within the given family. The permissions will be familiar to any DBA: R (read), W (write), C (create), X (execute), A (admin). However for various reasons, until today, cell level permissions were not supported.

Other high scale data storage options in the Apache family, notably Apache Accumulo, take a different approach. Accumulo has a data model almost identical to that of HBase, but implements a security mechanism called cell-level security. Every cell in an Accumulo store can have a label, stored effectively as part of the key, which is used to determine whether a value is visible to a given subject or not. The label is not an ACL, it is a different way of expressing security policy. An ACL says explicitly what subjects are authorized to do what. A label instead turns this on its head and describes the sensitivity of the information to a decision engine that then figures out if the subject is authorized to view data of that sensitivity based on (potentially, many) factors. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while enforcing strict separation between multiple levels of information classification. HBase users might approximate this model using ACLs, but it would be labor intensive and error prone.

New HBase Cell Security Features

Happily our team here at Intel has been busy extending HBase with cell level security features. First, contributed as HBASE-8496, HBase can now store arbitrary metadata for a cell, called tags, along with the cell. Then, as of HBASE-7662, HBase can store into and apply ACLs from cell tags, extending the current HBase ACL model down to the cell. Then, as of HBASE-7663, HBase can store visibility expressions into tags, providing cell-level security capabilities similar to Apache Accumulo, with API and shell support that will be familiar to Accumulo users. Finally, we have also contributed transparent server side encryption, as HBASE-7544, for additional assurance against accidental leakage of data at rest. We are working with the HBase community to make these features available in the next major release of HBase, 0.98.

Let’s talk a bit more now about HBase visibility labels and per-cell ACLs work.

HFile version 3

In order to take advantage of any cell level access control features, it will be necessary to store data in the new HFile version, 3. HFile version 3 is very similar to HFile version 2 except it also has support for persisting and retrieving cell tags, optional dictionary based compression of tag contents, and the HBASE-7544 encryption feature. Enabling HFile version 3 is as easy as adding a single configuration setting to all HBase site XML files, followed by a rolling restart. All existing HFile version 2 files will be read normally. New files will be written in the version 3 format. Although HFile version 3 will be marked as experimental throughout the HBase 0.98 release cycle, we have found it to be very stable under high stress conditions on our test clusters.

HBase Visibility Labels

We have introduced a new coprocessor, the VisibilityController, which can be used on its own or in conjunction with HBase’s AccessController (responsible for ACL handling). The VisibilityController determines, based on label metadata stored in the cell tag and associated with a given subject, if the user is authorized to view the cell. The maximal set of labels granted to a user is managed by new shell commands getauths, setauths, and clearauths, and stored in a new HBase system table. Accumulo users will find the new HBase shell commands familiar.

When storing or mutating a cell, the HBase user can now add visibility expressions, using a backwards compatible extension to the HBase API. (By backwards compatible, we mean older servers will simply ignore the new cell metadata, as opposed to throw an exception or fail.)

Mutation#setCellVisibility(new CellVisibility(String labelExpession));

The visibility expression can contain labels joined with logical expressions ‘&’, ‘|’ and ‘!’. Also using ‘(‘, ‘)’ one can specify the precedence order. For example, consider the label set { confidential, secret, topsecret, probationary }, where the first three are sensitivity classifications and the last describes if an employee is probationary or not. If a cell is stored with this visibility expression:

( secret | topsecret ) & !probationary

Then any user associated with the secret or topsecret label will be able to view the cell, as long as the user is not also associated with the probationary label. Furthermore, any user only associated with the confidential label, whether probationary or not, will not see the cell or even know of its existence. Accumulo users will also find HBase visibility expressions familiar, but also providing a superset of boolean operators.

We build the user’s label set in the RPC context when a request is first received by the HBase RegionServer. How users are associated with labels is pluggable. The default plugin passes through labels specified in Authorizations added to the Get or Scan. This will also be familiar to Accumulo users. 

Get#setAuthorizations(new Authorizations(String,…));

Scan#setAuthorizations(new Authorizations(String,…));

Authorizations not in the maximal set of labels granted to the user are dropped. From this point, visibility expression processing is very fast, using set operations.

In the future we envision additional plugins which may interrogate an external source when building the effective label set for a user, for example LDAP or Active Directory. Consider our earlier example. Perhaps the sensitivity classifications are attached when cells are stored into HBase, but the probationary label, determined by the user’s employment status, is provided by an external authorization service.

HBase Cell ACLs

We have extended the existing HBase ACL model to the cell level.

When storing or mutating a cell, the HBase user can now add ACLs, using a backwards compatible extension to the HBase API.

Mutation#setACL(String user, Permission perms);

Like at the table or column family level, a subject is granted permissions to the cell. Any number of permissions for any number of users (or groups using @group notation) can be added.

From then on, access checks for operations on the cell include the permissions recorded in the cell’s ACL tag, using union-of-permission semantics.

First we check table or column family level permissions*. If they grant access, we can early out before going to blockcache or disk to check the cell for ACLs. Table designers and security architects can therefore optimize for the common case by granting users permissions at the table or column family level. However, if indeed some cells require more fine grained control, if neither table nor column family checks succeed, we will enumerate the cells covered by the operation. By “covered”, this means we insure that every location which would be visibly modified by the operation has a cell ACL in place that grants access. We can stop at the first cell ACL that does not grant access.

For a Put, Increment, or Append we check the permissions for the most recent visible version. “Visible” means not covered by a delete tombstone. We treat the ACLs in each Put as timestamped like any other HBase value. A new ACL in a new Put applies to that Put. It doesn't change the ACL of any previous Put. This allows simple evolution of security policy over time without requiring expensive updates. To change the ACL at a specific { row, column, qualifier, timestamp } coordinate, a new value with a new ACL must be stored to that location exactly.

For Increments and Appends, we do the same thing as with Puts, except we will propagate any ACLs on the previous value unless the operation carries a new ACL explicitly.

Finally, for a Delete, we require write permissions on all cells covered by the delete. Unlike in the case of other mutations we need to check all visible prior versions, because a major compaction could remove them. If the user doesn't have permission to overwrite any of the visible versions ("visible", again, is defined as not covered by a tombstone already) then we have to disallow the operation.

* - For the sake of coherent explanation, this overlooks an additional feature. Optionally, on a per-operation basis, how cell ACLs are factored into the authorization decision can be flipped around. Instead of first checking table or column family level permissions, we enumerate the set of ACLs covered by the operation first, and only if there are no grants there we check for table or column family permissions. This is useful for use cases where a user is not granted table or column family permissions on a table and instead the cell level ACLs provide exceptional access. The default is useful for use cases where the user is granted table or column family permissions and cell level ACLs might withhold authorization. The default is likely to perform better. Again, which strategy is used can be specified on a per-operation basis.


This blog post was republished from https://communities.intel.com/community/datastack/blog/2013/10/29/hbase-cell-security

Friday Oct 25, 2013

Phoenix Guest Blog


James Taylor 

I'm thrilled to be guest blogging today about Phoenix, a BSD-licensed open source SQL skin for HBase that powers the HBase use cases at Salesforce.com. As opposed to relying on map-reduce to execute SQL queries as other similar projects, Phoenix compiles your SQL query into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. On top of that, you can add secondary indexes to your tables to transform what would normally be a full table scan into a series of point gets (and we all know how good HBase performs with those).

Getting Started

To get started using phoenix, follow these directions:

  • Download and expand the latest phoenix-2.1.0-install.tar from download page
  • Add the phoenix-2.1.0.jar to the classpath of every HBase region server. An easy way to do this is to copy it into the HBase lib directory.
  • Restart all region servers
  • Start our terminal interface that is bundled with our distribution against your cluster:
          $ bin/sqlline.sh localhost
  • Create a Phoenix table:
          create table stock_price(ticker varchar(6), price decimal, date date);
  • Load some data either through one of our CSV loading tools, or by mapping an existing HBase table to Phoenix table.
  • Run some queries:
          select * from stock_price;
          select * from stock_price where ticker='CRM';
          select ticker, avg(price) from stock_price 
              where date between current_date()-30 and current_date() 
              group by ticker;


Take a look at our recent announcement for our release which includes support for secondary indexing, peruse our FAQs and join our mailing list.

Tuesday Oct 22, 2013

HBase 0.96.0 Released

by Stack, the 0.96.0 Release Manager

Here are some notes on our recent hbase-0.96.0 release (For the complete list of over 2k issues addressed in 0.96.0, see Apache JIRA Release Notes).

hbase-0.96.0 was more than a year in the making.   It was heralded by three developer releases -- 0.95.0, 0.95.1 and 0.95.2 -- and it went through six release candidates before we arrived at our final assembly released on Friday, October 18th, 2013.

The big themes that drove this release gleaned of rough survey of users and our experience with HBase deploys were:

  • Improved stability: A new suite of integration cluster tests (HBASE-6241 HBASE-6201), configurable by node count, data sizing, duration, and “chaos” quotient, turned up loads of bugs around assignment and data views when scanning and fetching.  These we fixed in hbase-0.96.0.  Table locks added for cross-cluster alterations and cross-row transaction support, enabled on our system tables by default, now have us wide-berth whole classes of problematic states.

  • Scaling: HBase is being deployed on larger clusters.  How we kept schema in the filesystem or our archiving a WAL file at a time when done replicating worked fine on clusters of hundreds of nodes but made for significant friction when we moved to the next scaling level up.

  • Mean Time To Recovery (MTTR): A sustained effort in HBase and in our substrate, HDFS, narrowed the amount of time data is offline after node outage.

  • Operability: Many new tools were added to help operators of hbase clusters: from a radical redo of the metrics emissions, through a new UI  and exposed hooks for health scripts.  It is now possible to trace lagging calls down through the HBase stack (HBASE-9121 Update HTrace to 2.00 and add new example usage) to figure where time is spent and soon, through HDFS itself, with support for pretty visualizations in Twitter Zipkin (See HBase Tracing from a recent meetup).

  • Freedom to Evolve: We redid how we persisted everywhere, whether in the filesystem or up into zookeeper, but also how we carry queries and data back and forth over RPC.  Where serialization was hand-crafted when we were on Hadoop Writables, we now use generated Google protobufs.  Standardizing serialization on protobufs, with well-defined schemas, will make it easier evolving versions of the client and servers independently of each other in a compatible manner without having to take a cluster restart going forward.

  • Support for hadoop1 and hadoop2: hbase-0.96.0 will run on either.  We do not ship a universal binary.  Rather you must pick your poison; hbase-0.96.0-hadoop1 or hbase-0.96.0-hadoop2 (Differences in APIs between the two versions of Hadoop forced this delivery format).  hadoop2 is far superior to hadoop1 so we encourage you move to it.  hadoop2 has improvements that make HBase operation run smoother, facilitates better performance -- e.g. secure short-circuit reads -- as well as fixes that help our MTTR story.

  • Minimal disturbance to the API: Downstream projects should just work.  The API has been cleaned up and divided into user vs developer APIs and all has been annotated using Hadoop’s system for denoting APIs stable, evolving, or private.  That said, a load of work was invested making it so APIs were retained. Radical changes in API that were present in the last developer release were undone in late release candidates because of downstreamer feedback.

Below we dig in on a few of the themes and features shipped in 0.96.0.

Mean Time To Recovery

HBase guarantees a consistent view by having a single server at a time solely responsible for data. If this server crashes, data is ‘offline’ until another server assumes responsibility.  When we talk of improving Mean Time To Recovery in HBase, we mean narrowing the time during which data is offline after a node crash.  This offline period is made up of phases: a detection phase, a repair phase, reassignment, and finally, clients noticing the data available in its new location.  A fleet of fixes to shrink all of these distinct phases have gone into hbase-0.96.0.

In the detection phase, the default zookeeper session period has been shrunk and a sample watcher script will intercede on server outage and delete the regionservers ephemeral node so the master notices the crashed server missing sooner (HBASE-5844 Delete the region servers znode after a regions server crash). The same goes for the master (HBASE-5926  Delete the master znode after a master crash). At repair time, a running tally makes it so we replay fewer edits cutting replay time (HBASE-6659 Port HBASE-6508 Filter out edits at log split time).  A new replay mechanism has also been added (HBASE-7006 Distributed log replay -- disabled by default) that speeds recovery by skipping having to persist intermediate files in HDFS.  The HBase system table now has its own dedicated WAL, so this critical table can come back before all others (See HBASE-7213 / HBASE-8631).  Assignment all around has been speeded up by bulking up operations, removing synchronizations, and multi-threading so operations can run in parallel.

In HDFS, a new notion of ‘staleness’ was introduced (HDFS-3703, HDFS-3712).  On recovery, the namenode will avoid including stale datanodes saving on our having to first timeout against dead nodes before we can make progress (Related, HBase avoids writing a local replica when writing the WAL instead writing all replicas remote out on the cluster; the replica that was on the dead datanode is of no use come recovery time. See HBASE-6435 Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes).  Other fixes, such as HDFS-4721 Speed up lease/block recovery when DN fails and a block goes into recovery shorten the time involved assuming ownership of the last, unclosed WAL on server crash.

And there is more to come inside the 0.96.x timeline: e.g. bringing regions online immediately for writes, retained locality when regions come up on the new server because we write replicas using the ‘Favored Nodes’ feature, etc.

Be sure to visit the Reference Guide for configurations that enable and tighten MTTR; for instance, ‘staleness’ detection in HDFS needs to be enabled on the HDFS-side.  See the Reference Guide for how.

HBASE-5305 Improve cross-version compatibility & upgradeability

Freedom to Evolve

Rather than continue to hand-write serializations as is required when using Hadoop Writables, our serialization means up through hbase-0.94.x, in hbase-0.96.0 we moved the whole shebang over to protobufs.  Everywhere HBase persists we now use protobuf serializations whether writing zookeeper znodes, files in HDFS, and whenever we send data over the wire when RPC’ing.

Protobufs support evolving types, if careful, making it so we can amend Interfaces in a compatible way going forward, a freedom we were sorely missing -- or to be more precise, was painful to do -- when all serialization was by hand in Hadoop Writables.  This change breaks compatibility with previous versions.

Our RPC is also now described using protobuf Service definitions.  Generated stubs are hooked up to a derivative, stripped down version of the Hadoop RPC transport.  Our RPC now has a specification.  See the Appendix in the Reference Guide.

HBASE-8015 Support for Namespaces

Our brothers and sisters over at Yahoo! contributed table namespaces, a means of grouping tables similar to mysql’s notion of database, so they can better manage their multi-tenant deploys.  To follow in short order will be quota, resource allocation, and security all by namespace.

HBASE-4050 Rationalize metrics, metric2 framework implementation

New metrics have been added and the whole plethora given a radical edit, better categorization, naming and typing; patterns were enforced so the myriad metrics are navigable and look pretty up in JMX.  Metrics have been moved up on to the Hadoop 2 Metrics 2 Interfaces.  See Migration to the New Metrics Hotness – Metrics2 for detail.

New Region Balancer

A new balancer using an algorithm similar to Simulated annealing or Greedy Hillclimbing factors in not only region count, the only attribute considered by the old balancer, but also region read/write load, locality, among other attributes, coming up with a balance decision.


In hbase-0.96.0, we began work on a long-time effort to move off of our base KeyValue type and move instead to use a Cell Interface throughout the system.  The intent is to open up the way to try different implementations of the base type; different encodings, compressions, and layouts of content to better align with how the machine works. The move, though not yet complete, has already yielded performance gains.  The Cell Interface shows through in our hbase-0.96.0 API with the KeyValue references deprecated in 0.96.0.  All further work should be internal-only and transparent to the user.

Incompatible changes


You will need to restart your cluster to come up on hbase-0.96.0.  After deploying the binaries, run a checker script that will look for the existence of old format HFiles no longer supported in hbase-0.96.0.  The script will warn you of their presence and will ask you to compact them away.  This can be done without disturbing current serving.  Once all have been purged, stop your cluster, and run a small migration script. The HBase migration script will upgrade the content of zookeeper and rearrange the content of the filesystem to support the new table namespaces feature.  The migration should take a few minutes at most.  Restart.  See Upgrading from 0.94.x to 0.96.x for details.

From here on out, 0.96.x point releases with bug fixes only will start showing up on a roughly monthly basis after the model established in our hbase-0.94 line.  hbase-0.98.0, our next major version, is scheduled to follow in short order (months).  You will be able to do a rolling restart off 0.96.x and up onto 0.98.0.  Guaranteed.

A big thanks goes out to all who helped make hbase-0.96.0 possible.

This release is dedicated to Shaneal Manek, HBase contributor.

Download your hbase-0.96.0 here.

Thursday Aug 08, 2013

Taking the Bait

Lars Hofhansl, Andrew Purtell, and Michael Stack

HBase Committers

Information Week recently published an article titled “Will HBase Dominate NoSQL?”. Michael Hausenblas of MapR argues the ‘For’ HBase case and Jonathan Ellis of Apache Cassandra and vendor DataStax argues ‘Against’.

It is easy to dismiss this 'debate' as vendor sales talk and just get back to using and improving Apache HBase, but this article is a particularly troubling example. Here in both the ‘For’ and ‘Against’ arguments slight is being cast on the work of the HBase community. Here are some notes by way of redress:

First, Michael argues Hadoop is growing fast and because HBase came out of Hadoop and is tightly associated, ergo, HBase is on the up and up.  It is easy to make this assumption if you are not an active participant in the HBase community (We have also come across the inverse where HBase is driving Hadoop adoption). Michael then switches to a tired HBase versus Cassandra bake off, rehashing the old consistency versus eventual-consistency wars, ultimately concluding that HBase is ‘better’ simply because Facebook, the Cassandra wellspring, dropped it to use HBase instead as the datastore for some of their large apps. We would not make that kind of argument. Whether or not one should utilize Apache HBase or Apache Cassandra depends on a number of factors and is, like with most matters of scale, too involved a discussion for sound bites.

Then Michael does a bait-and-switch where he says “..we’ve created a ‘next version’ of enterprise HBase.... We brought it into GA under the label M7 in May 2013”.  The ‘We’ in the quote does not refer to the Apache HBase community but to Michael’s employer, MapR Technologies and the ‘enterprise HBase’ he is talking of is not Apache HBase.  M7 is a proprietary product that, to the best of our knowledge, is fundamentally different architecturally from Apache HBase. We cannot say more because of the closed source nature of the product. This strikes us as an attempt to attach the credit and good will that the Apache HBase community have all built up over the years through hard work and contributions to a commercial closed source product that is NOT Apache HBase.

Now let us address Jonathan’s “Against” argument.  Some of Jonathan’s claims are no longer true: “RegionServer failover takes 10 to 15 minutes” (see HDFS-3703 and HDFS-3912), or highly subjective: “Developing against HBase is painful.”  (In our opinion, our client API is simpler and easier to use than the commonly used Cassandra client libraries.) Otherwise, we find nothing here that has not been hashed and rehashed out over the years in forums ranging from mailing lists to Quora. Jonathan is mostly listing out provence, what comes of our being tightly coupled to Apache Hadoop’s HDFS filesystem and our tracing the outline of the Google BigTable Architecture. HBase derives many benefits from HDFS and inspiration from BigTable. As a result, we excel at some use cases that would be problematic for Cassandra. The reverse is also true. Neither we nor HDFS are standing still as the hardware used by our users evolves, or as new users bring experience from new use cases.

He highlights a difference in approach.

We see our adoption of Apache HDFS, our integration with Apache Hadoop MapReduce, and our use of Apache ZooKeeper for coordination as a sound separation of concerns. HBase is built on battle tested components. This is a feature, not a bug.

Where Jonathan might see a built in query language and secondary indexing facility as necessary complications to core Cassandra, we encourage and support projects like Salesforce’s Phoenix as part of a larger ecosystem centered around Apache HBase. The Phoenix guys are able to bring domain expertise to that part of the problem while we (HBase) can focus on providing a stable and performant storage engine core. Part of what has made Apache Hadoop so successful is its ecosystem of supporting and enriching projects - an ecosystem that includes HBase. An ecosystem like that is developing around HBase.

When Jonathan veers off to talk of the HBase community being “fragmented” with divided “[l]eadership”, we think perhaps what is being referred to is the fact that the Apache HBase project is not an “owned” project, a project led by a single vendor.  Rather it is a project where multiple teams from many organizations - Cloudera, Hortonworks, Salesforce, Facebook, Yahoo, Intel, Twitter, and Taobao, to name a few - are all pushing the project forward in collaboration. Most of the Apache HBase community participates in shared effort on two branches - what is about to be our new 0.96 release, and on another branch for our current stable release, 0.94. Facebook also maintains their own branch of Apache HBase in our shared source control repository. This branch has been a source of inspiration and occasional direct contribution to other branches. We make no apologies for finding a convenient and effective collaborative model according to the needs and wishes of our community. To other projects driven by a single vendor this may seem suboptimal or even chaotic (“fragmented”).

We’ll leave it at that.

As always we welcome your participation in and contributions to our community and the Apache HBase project. We have a great product and a no-ego low-friction decision making process not beholden to any particular commercial concern. If you have an itch that needs scratching and wonder if Apache HBase is a solution, or, if you are using Apache HBase but feel it could work better for you, come talk to us.

Monday Jul 29, 2013

Data Types != Schema

by Nick Dimiduk 

My work on adding data types to HBase has come along far enough that ambiguities in the conversation are finally starting to shake out. These were issues I’d hoped to address through initial design documentation and a draft specification. Unfortunately, it’s not until there’s real code implemented that the finer points are addressed in concrete. I’d like to take a step back from the code for a moment to initiate the conversation again and hopefully clarify some points about how I’ve approached this new feature.

If you came here looking for code, you’re out of luck. Go check out the parent ticket, HBASE-8089. For those who don’t care about my personal experiences with information theory, skip down to the TL;DR section at the end. You might also be satisfied by the slides I presented on this same topic at the Hadoop Summit HBase session in June.

A database by any other name

“HBase is a database.” This is the very first statement in HBase in Action. Amandeep and I debated, both between ourselves and with a few of our confidants, as to whether we should make that statement. The concern in my mind was never over the validity of the claim, but rather how it would be interpreted. “Database” has come to encompass a great deal of technologies and features, many of those features HBase doesn’t (yet) support. The confusion is worsened by the recent popularity of non-relational databases under the umbrella title NoSQL, a term which itself is confused [1]. In this post, I hope to tease apart some of these related ideas.

My experience with data persistence systems started with my first position out of university. I worked for a small company whose product, at its core, was a hierarchical database. That database had a very bare concept of tables and there was no SQL interface. It’s primary construct was a hierarchy of nodes and its query engine was very good at traversing that hierarchy. The hierarchy was also all it exposed to its consumers, and querying the database was semantically equivalent to walking that hierarchy and executing functions on an individual node and its children. The only way to communicate with it was via a Java API, or later, the C++ interface. For a very long time, the only data type it could persist was a C-style char[]. Yet a client could connect to the database server over the network, persist data into it, and issue queries to retrieve previously persisted data and transformed versions thereof. It didn’t support SQL, it only spoke in Strings, but it was a database.

Under the hood, this data storage system used an open source, embedded database library with which you’re very likely familiar. The API for that database exposed a linear sequence of pages allocated from disk. Each page held a byte[]. You can think of the whole database as a persisted byte[][]. Queries against that database involved requesting a specific page by its ID and it returned to you the raw block of data that resided there. Our database engine delegated persistence responsibilities to that system, using it to manage it’s own concepts and data structures in a format that could be serialized to bytes on disk. Indeed, that embedded database library delegated much of its own persistence responsibilities to the operating system’s filesystem implementation.

In common usage, the word “database” tends to be shorthand for Relational Database Management System. Neither the hierarchical database, the embedded database, nor the filesystem qualify by this definition. Yet all three persist and retrieve data according to well defined semantics. HBase is also not a Relational Database Management System, but it persists and retrieves data according to well defined semantics. HBase is a database.

Data management as a continuum

Please bear with me as I wander blindly into the world of information theory.

I think of data management as a continuum. On the one extreme, we have the raw physical substrate on which information is expressed as matter. On the far other extreme is our ability to reason and form understandings about a catalog of knowledge. In computer systems, we narrow the scope of that continuum, but it’s still a range from the physical allocation of bits to a structure that can be interpreted by humans.

physical bits                             meaning
     |                                       |
     |                                       |

A database provides an interface over which the persisted data is available for interaction. Those interacting with it are generally technical humans and systems that expose that data to non-technical humans through applications. The humans’ goal is primarily to derive meaning from that persisted data. The RDBMS exposes an interface that’s a lot closer to the human level than HBase or my other examples. That interface is brought closer in large part because the RDBMS includes a system of metadata description and management called aschema. Exposing a schema, a way to describe the data physically persisted, acts as a bridge from database to non-technical humans. It allows the human to describe the information they want to persist in a way that has meaning to both the human and the database.

A schema is metadata. It’s a description of the shape of the data and also provides hints to its intended meaning. Computer systems represent data as sequences of binary data. The schema helps us make sense of those digits. A schema can tell us that 0x01c7c6 represents the numeric value 99.99 which means “the average price of your monthly mobile phone bill.”

In addition to providing data management tools, most RDBMSs provided schema management tools. Managing schema is just as important as managing the data itself. I say that because without schema, how can I begin to understand what this collection of data means? As knowledge and needs change, so too does data. Just as the data management tools provide a method for changing data values, the schema management tools provide a method for tracking the change in meaning of the data.

From here to there

A database does not get a schema for free. In order to describe meaning of persisted data, a schema needs a few building block concepts. Relational systems derive their name from the system of relational algebra upon which they describe their data and its access. A table contains records that all conform to a particular shape and that shape is described by a sequence of labeled columns. Tables often represent something specific and the columns describe attributes of that something. As humans, we often find it helpful to describe the domain of valid values an attribute can take. 99.99 makes sense for the average price described above, while hello worlddoes not. A layer of abstraction is introduced, and we might describe the range of valid values for this average price as a numeric value representing a unit of currency with up to two decimals of precision between the range of 0.00 and 9999.99. We describe that aspect of the schema as a data type.

The “currency” data type we just defined allows us to be more specific about the meaning of the attribute in our schema. Better still, if we can describe the data type to our database, we can let it monitor and constrain the accepted values used in that attribute. That’s helpful for humans because the computer is probably better as managing these constraints than we are. It’s also helpful for the database because it no longer needs to store “any of the things” in this attribute. Instead, it only must store any of the valid values of this data type. That allows it to optimize the way it stores those values and potentially provide other meaningful operations on those values. With a data type defined, it opens the database to be able to answer queries about the data and ranges of data instead of just persisting and retrieving values. “What’s the lowest average price?” “What’s the highest?” “By how much do the average prices deviate?”

The filesystem upon which my hierarchical database sat couldn’t answer questions like those.

Data types bridge the gap between persistence layers and schema, allowing the database to share in the responsibility of value constraint management and allowing it to do more than just persist values. But data types are only half of the story. Just because I’ve declared an attribute to be of type “numeric” doesn’t mean the database can persist it. A data type can be implemented that honors the constraints upon numerical values, but there’s still another step between my value and the sequence of binary digits. That step is the encoding.

An encoding is a way to represent a value in binary digits. The simplest encoding for integer values is the representation of that number in base-2; this is a literal representation in binary digits. Encodings come with limitations though, this one included. In this case, it provides no means to represent a negative integer value. The 2’s compliment encoding has the advantage of being able to represent negative integers. It also enjoys the convenience that most arithmetic operations on values in this encoding behave naturally.

Binary coded decimal is another encoding for integer values. It has different properties and different advantages and disadvantages than 2’s compliment. Both are equally valid ways to represent integers as a sequence of binary data. Thus an integer data type, honoring all the constraints of integer values can be encoded in multiple ways. Continuing the example, just like there are multiple valid relational schema designs to derive meaning over a data set of mobile subscribers, so too are there multiple valid encodings to represent an integer value [2].

Data types for HBase

Thus far in its lifetime, HBase has provided data persistence. It does so in a rather unique way as compared to other databases. That method of persistence influences the semantics it exposes around persisting and retrieving the data it stores. To date, those semantics have exposed a very simple logical data model, that of a sorted, nested map of maps. That data model is heavily influenced by the physical data model of the database implementation.

Technically this data model is a schema because it defines a logical structure for data, complete with a data type. However, this model is very rudimentary as schemas go. It provides very few facilities for mapping application-level meaning to physical layout. The only data type this logical data model exposes is the humble byte[] and its encoding is a simple no-op [3].

While the byte[] is strictly sufficient, it’s not particularly convenient for application developers. I don’t want to think about my average subscription price as a byte[], but rather as a value conforming to the numeric type described earlier. HBase requires that I accept the burden of both data type constraint maintenance and data value encoding into my application.

HBase does provide a number of data encodings for Java languages primitive types. These encodings are implemented in the toXXX methods on the Bytes class. These methods transform the Java types into byte[]and back again. The trouble is they (mostly, partially) don’t preserve the sort order of the values they represent. This is a problem.

HBase’s semantics of a sorted map of maps is extremely important in designing table layouts for applications. The sort order influences physical layout of data on disk, which has direct impact on data access latency. The practice of HBase “schema design” is the task of laying out your data physically so as to minimize the latency of the access patterns that are important to your application. A major aspect of that is in designing a rowkey that orders well for the application. Because the default encodings do not always honor the natural sorting of the values they represent, it can become difficult to reason about application performance. Application developers are left to devise their own encoding systems that honor the natural sorting of any data types they wish to use.

Doing more for application developers

In HBASE-8089, I proposed that we expand the set of data types that HBase exposes to include a number of new members. The idea being that those other types make it easier for developers to build applications. It includes an initial list of suggested data types and some proposals about how they might be implemented, hinting toward considerations of order-preserving encodings.

HBASE-8201 defines a new utility class for data encodings called OrderedBytes. The encodings implemented there are designed primarily to produce byte[]s that preserve the natural sort order of the values they represent. They are also implemented in such a way as to be self-identifying. Meaning, an encoded value can be inspected to determine which encoding it represents. This last feature makes the encoding scheme at least rudimentarily machine readable and is particularly valuable, in my opinion. It enables reader tools (a raw data inspector, for example) to be encoding aware even in the absence of knowledge about schema or data types.

HBASE-8693 advocates an extensible data type API, so that application developers can easily introduce new data types. This allows HBase applications to implement their own data types that the HBase community hasn’t thought of or does not think are appropriate to ship with the core system. The working implementation of that DataType API and a number of pre-supported data types are provided. Those data types are built on the two codecs, Bytes and OrderedBytes. That means application developers will have access to basic constructs like number and Strings in addition to byte[]. It also means that sophisticated users can develop highly specialized data types for use as the foundation of extensions to HBase. My hope is this will make extension efforts on par with PostGIS possible in HBase.

Please take note that nothing here approaches the topic of schema or schema management. My personal opinion is that not enough complex applications have been written against HBase for the project to ship with such a system out of the box. Two notable efforts which are imposing schema onto HBase are Phoenix and Kiji. The former seeks to translate a subset of the Relational model onto HBase. The latter is devising its own solution, presumably modeled after its authors’ own experiences. In both cases, I hope these projects can benefit by HBase providing some additional data encodings and an API for user-extensible data types.


It’s an exciting time to be building large, data-driven applications. We enjoy a wealth of new tools, not just HBase, to make that easier than ever before. Still, those same tools that make things possible are still in infant stages of usability. Hopefully these efforts will move the conversation forward. Please take a moment to review these tickets. Let us know what data types we haven’t thought about and what encoding schemes you fancy. Poke holes in the data type extension API and provide counter examples that you can’t implement due to lack of expression. Take this opportunity to customize your tools to better fit your own hands.


[1] The term NoSQL is used to reference pretty much anything that stores data and didn’t exist a decade ago. This includes quite an array of data tools. Amandeep and I studied and summarized the landscape in a short body of work that was cut from the book. Key-value stores, graph databases, in-memory stores, object stores, and hybrids of the above all make the cut. About all they can agree on is they don’t like using the relational model to describe their data.

[2] For more fun, check out the ZigZag encoding described in the documentation of Google’s protobuf.

[3] That’s almost true. HBase also provides a uint64 data type exposed through the Increment API.



Hot Blogs (today's hits)

Tag Cloud