Entries tagged [hadoop]

Saturday June 06, 2015

Saving CPU! Using Native Hadoop Libraries for CRC computation in HBase

by Apekshit Sharma, HBase contributor and Cloudera Engineer

TL;DR Use Hadoop Native Library calculating CRC and save CPU!

Checksums in HBase

Checksum are used to check data integrity. HDFS computes and stores checksums for all files on write. One checksum is written per chunk of data (size can be configured using bytes.per.checksum) in a separate, companion checksum file. When data is read back, the file with corresponding checksums is read back as well and is used to ensure data integrity. However, having two files results in two disk seeks reading any chunk of data. For HBase, the extra seek while reading HFileBlock results in extra latency. To work around the extra seek, HBase inlines checksums. HBase calculates checksums for the data in a HFileBlock and appends them to the end of the block itself on write to HDFS (HDFS then checksums the HBase data+inline checksums). On read, by default HDFS checksum verification is turned off, and HBase itself verifies data integrity.

Can we then get rid of HDFS checksum altogether? Unfortunately no. While HBase can detect corruptions, it can’t fix them, whereas HDFS uses replication and a background process to detect and *fix* data corruptions if and when they happen. Since HDFS checksums generated at write-time are also available, we fall back to them when HBase verification fails for any reason. If the HDFS check fails too, the data is reported as corrupt.

The related hbase configurations are hbase.hstore.checksum.algorithm, hbase.hstore.bytes.per.checksum and hbase.regionserver.checksum.verify. HBase inline checksums are enabled by default.

Calculating checksums is computationally expensive and requires lots of CPU. When HDFS switched over to JNI + C for computing checksums, they witnessed big gains in CPU usage.

This post is about replicating those gains in HBase by using Native Hadoop Libraries (NHL). See HBASE-11927


We switched to use the Hadoop DataChecksum library which under-the-hood uses NHL if available, else we fall back to use the Java CRC implementation. Another alternative considered was the ‘Circe’ library. The following table highlights the differences with NHL and makes the reasoning for our choice clear.

Hadoop Native Library


Native code supports both crc32 and crc32c

Native code supports only crc32c

Adds dependency on hadoop-common which is reliable and actively developed

Adds dependency on external project

Interface supports taking in stream of data, stream of checksums, chunk size as parameters and compute/verify checksums  considering data in chunks.

Only supports calculation of single checksum for all input data.

Both libraries supported use of the special x86 instruction for hardware calculation of CRC32C if available (defined in SSE4.2 instruction set). In the case of NHL, hadoop-2.6.0 or newer version is required for HBase to get the native checksum benefit.

However, based on the data layout of HFileBlock, which has ‘real data’ followed by checksums on the end, only NHL supported the interface we wanted. Implementing the same in Circe would have been significant effort. So we chose to go with NHL.


Since the metric to be evaluated was CPU usage, a simple configuration of two nodes was used. Node1 was configured to be the NameNode, Zookeeper and HBase master. Node2 was configured to be DataNode and RegionServer. All real computational work was done on Node2 while Node1 remained idle most of the time. This isolation of work on a single node made it easier to measure impact on CPU usage.


Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-24-generic x86_64)

CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz                                          

Socket(s) : 1

Core(s) per socket : 1

Thread(s) per core : 4

Logical CPU(s) : 4

Number of disks : 1

Memory : 8 GB

HBase Version/Distro *: 1.0.0 / CDH 5.4.0

*Since trunk resolves to hadoop-2.5.1 which does not have HDFS-6865, it was easier to use a CDH distro which already has HDFS-6865 backported.


We chose to study the impact on major compactions, mainly because of the presence of CompactionTool which a) can be used offline, b) allowed us to profile only the relevant workload. PerformanceEvaluation (bin/hbase pe) was used to build a test table which was then copied to local disk for reuse.

% ./bin/hbase pe --nomapred --rows=150000 --table="t1" --valueSize=10240 --presplit=10 sequentialWrite 10

Table size: 14.4G

Number of rows: 1.5M

Number of regions: 10

Row size: 10K

Total store files across regions: 67

For profiling, Lightweight-java-profiler was used and FlameGraph was used to generate graphs.

For benchmarking, the linux ‘time’ command was used. Profiling was disabled during these runs. A script repeatedly executed following in order:

  1. delete hdfs:///hbase

  2. copy t1 from local disk hdfs:///hbase/data/default

  3. run compaction tool on t1 and time it



CPU profiling of HBase not using NHL (figure 1) shows that about 22% cpu is used for generating and validating checksums, whereas, while using NHL (figure 2) it takes only about 3%.

Screen Shot 2015-06-02 at 8.14.07 PM.png

Figure 1: CPU profile - HBase not using NHL (svg)

Screen Shot 2015-06-02 at 8.03.39 PM.png

Figure 2: CPU profile - HBase using NHL (svg)


Benchmarking was done for three different cases: (a) neither HBase nor HDFS use NHL, (b) HDFS uses NHL but not HBase, and (c) both HDFS and HBase use NHL. For each case, we did 5 runs. Observations from table 1:

  1. Within a case, while real time fluctuates across runs, user and sys times remain same. This is expected as compactions are IO bound.

  2. Using NHL only for HDFS reduces CPU usage by about 10% (A vs B)

  3. Further, using NHL for HBase checksums reduces CPU usage by about 23% (B vs C).

All times are in seconds. This stackoverflow answer provides a good explaination of real, user and sys times.

run #

no native for HDFS and HBase (A)

no native for HBase (B)

native (C)


real      469.4

user 110.8

sys 30.5

real    422.9

user    95.4

sys     30.5

real 414.6

user 67.5

sys 30.6


real 384.3

user 111.4

sys 30.4

real 400.5

user 96.7

sys 30.5

real 393.8

user     67.6

sys 30.6


real 400.7

user 111.5

sys 30.6

real 398.6

user 95.8

sys 30.6

real    392.0

user    66.9

sys     30.5


real 396.8

user 111.1

sys 30.3

real 379.5

user 96.0

sys 30.4

real    390.8

user    67.2

sys     30.5


real 389.1

user 111.6

sys 30.3

real 377.4

user 96.5

sys 30.4

real    381.3

user    67.6

sys     30.5

Table 1



Native Hadoop Library leverages the special processor instruction (if available) that does pipelining and other low level optimizations when performing CRC calculations. Using NHL in HBase for heavy checksum computation, allows HBase make use of this facility, saving significant amounts of CPU time checksumming.

Wednesday May 01, 2013

Migration to the New Metrics Hotness – Metrics2

by Elliott Clark

HBase Committer and Cloudera Engineer 

NOTE: This blog post describes the server code of HBase. It assumes a general knowledge of the system. You can read more about HBase in the blog posts here.


HBase is a distributed big data store modeled after Google’s Bigtable paper. As with all distributed systems, knowing what’s happening at a given time can help  spot problems before they arise, debug on-going issues, evaluate new usage patterns, and provide insight into capacity planning.

Since October 2008, version 0.19.0 , (HBASE-625) HBase has been using Hadoop’s metrics system to export metrics to JMX, Ganglia, and other metrics sinks. As the code base grew, more and more metrics were added by different developers. New features got metrics. When users needed more data on issues, they added more metrics. These new metrics were not always consistently named, and some were not well documented.

As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started.  I also wanted to clean up the implementation cruft that had accumulated.


The implementation details are pretty dense on terminology so lets make sure everything is defined:

  • Metric: A measurement of a property in the system.

  • Snapshot: A set of metrics at a given point in time.

  • Metrics1: The old Apache Hadoop metrics system.

  • Metrics2: The new overhauled Apache Hadoop Metrics system.

  • Source: A class that exposes metrics to the Hadoop metrics system.

  • Sink: A class that receives metrics snapshots from the Hadoop metrics system.

  • JMX: Java Management Extension. A system built into java that facilitates the management of java processes over a network; it includes the ability to expose metrics.

  • Dynamic Metrics: Metrics that come and go. These metrics are not all known at compile time; instead they are discovered at runtime.


The Hadoop Metrics2 system implementations in branch-1 and branch-2 have diverged pretty drastically. This means that a single implementation of the code to move metrics from HBase to metrics2 sinks would not be performant or easy. As a result I created different hadoop compatibility shims and a system to load a version at runtime. This led to using ServiceLoader to create an instance of any class that touched parts of Hadoop that had changed between branch-1 and branch-2.

Here is an example of how a region server could request a Hadoop 2 version of the shim for exposing metrics about the HRegionServer. (Hadoop 1’s compatibility jar is shown in dotted lines to indicate that it could be swapped in if Hadoop 1 was being used)

This system allows HBase to support both Hadoop 1.x and Hadoop 2.x implementations without using reflection or other tricks to get around differences in API, usage, and naming.

Now that HBase can use either the Hadoop 1 or Hadoop 2 versions of the metrics 2 systems, I set about cleaning up what metrics HBase exposes, how those metrics are exposed, naming, and performance of gathering the data.

Metrics2 uses either annotations or sources to expose metrics. Since HBase can’t require any part of the metrics2 system in the core classes I exposed all metrics from HBase by creating sources. For metrics that are known ahead of time I created wrappers around classes in the core of HBase that the metrics2 shims could interrogate for values. Here is an example on how HRegionServer’s metrics(the non-dynamic metrics) are exposed :

The above pattern can be repeated to expose a great deal of the metrics that HBase has. However metrics about specific regions are still very interesting but can’t be exposed following the above pattern. So a new solution that would allow metrics about regions to be exposed by whichever HRegionServer is hosting that region was needed. To complicate things further Hadoop’s metrics2 system needs one MetricsSource to be responsible for all metrics that are going to be exposed through a JMX mbean. In order for metrics about regions to be well laid out, HBase needs a way to aggregate metrics from multiple regions into one source. This source will then be responsible for knowing what regions are assigned to the regionserver. These requirements led me to have one aggregation source that contains pseudo-sources for each region. These pseudo-sources each contain a wrapper around the region. This leads to something that looks like this:


That’s a lot of work to re-do a previously working metrics system, so what was gained by all this work? The entire system is much easier to test in unit and systems tests. The whole system has been made more regular; that is everything follows the same patterns and naming conventions. Finally everything has been rewritten to be faster.

Since the previous metrics have all been added on as needed they were not all named well. Some metrics were named following the pattern: “metricNameCount” others were named following “numMetricName” while still others were named like “metricName_Count”. This made parsing hard and gave a generally chaotic feel. After the overhaul metrics that are a counter start with the camel cased metric name followed by the suffix “Count.” The mbeans were poorly laid out. Some metrics we spread out between two mbeans. Metrics about a region were under an mbean named Dynamic, not the most descriptive name. Now mbeans are much better organized and have better descriptions.

Tests have found that single threaded scans run as much as 9% faster after HBase’s old metrics system has been replaced. The previous system used lots of ConcurrentHashMap’s to store dynamic metrics. All accesses to mutate these metrics required a lookup into these large hash maps. The new system minimizes the use of maps. Instead every region or server exports metrics to one pseudo source. The only changes to hashmaps in the metrics system occurs on region close or open.


Overall the whole system is just better. The process was long and laborious, but worth it to make sure that HBase’s metrics system is in a good state. HBase 0.95, and later 0.96, will have the new metrics system.  There’s still more work to be completed but great strides have been made.



Hot Blogs (today's hits)

Tag Cloud