Apache HBase

Tuesday May 24, 2016

Why We Use HBase: Recap so far

Just a quick note: If you haven't seen them yet you should check out the first four entries in our "Why We Use HBase" series. More guest posts in the series on deck soon.

  1. Scalable Distributed Transactional Queues on Apache HBase

  2. Medium Data and Universal Data Systems (great read!)

  3. Imgur Notifications: From MySQL to HBase

  4. Investing In Big Data: Apache HBase

 - Andrew

Tuning G1GC For Your HBase Cluster

Graham Baecher is a Senior Software Engineer on HubSpot's infrastructure team and Eric Abbott is a Staff Software Engineer at HubSpot.

An earlier version of this post was published here on the HubSpot Product and Engineering blog.

Tuning G1GC For Your HBase Cluster

HBase is the big data store of choice for engineering at HubSpot. It’s a complicated data store with a multitude of levers and knobs that can be adjusted to tune performance. We’ve put a lot of effort into optimizing the performance and stability of our HBase clusters, and recently discovered that suboptimal G1GC tuning was playing a big part in issues we were seeing, especially with stability.

Each of our HBase clusters is made up of 6 to 40 AWS d2.8xlarge instances serving terabytes of data. Individual instances handle sustained loads over 50k ops/sec with peaks well beyond that. This post will cover the efforts undertaken to iron out G1GC-related performance issues in these clusters. If you haven't already, we suggest getting familiar with the characteristics, quirks, and terminology of G1GC first.

We first discovered that G1GC might be a source of pain while investigating frequent

“...FSHLog: Slow sync cost: ### ms...”

messages in our RegionServer logs. Their occurrence correlated very closely to GC pauses, so we dug further into RegionServer GC. We discovered three issues related to GC:

  • One cluster was losing nodes regularly due to long GC pauses.
  • The overall GC time was between 15-25% during peak hours.
  • Individual GC events were frequently above 500ms, with many over 1s.

Below are the 7 tuning iterations we tried in order to solve these issues, and how each one played out. As a result, we developed a step-by-step summary for tuning HBase clusters.

Original GC Tuning State

The original JVM tuning was based on an Intel blog post, and over time morphed into the following configuration just prior to our major tuning effort.

-Xmx32g -Xms32g

32 GB heap, initial and max should be the same

-XX:G1NewSizePercent= 3-9

Minimum size for Eden each epoch, differs by cluster


Optimistic target, most clusters take 100+ ms


Better stack traces in some circumstances, traded for a bit more CPU usage


Helps keep a lid on reference processing time issues were were seeing


Alleged to protect against a bizarre Linux issue


Alleged to save some CPU cycles in between GC epochs

GC logging verbosity as shown below was cranked up to a high enough level of detail for our homegrown gc_log_visualizer script. The majority of graphs in this document were created with gc_log_visualizer, while others were snapshotted from SignalFX data collected through our CollectD GC metrics plugin.

Our GC logging params:


Starting Point: Heap Sizes and Overall Time Spent in GC

With the highly detailed GC logs came the following chart of heap state. Eden size is in red and stays put at its minimum (G1NewSizePercent), 9% of total heap. Tenured size, or Working set + waste, floats in a narrow band between 18-20gb. With Eden a flat line, the total heap line will mirror the Tenured line, just 9% higher.

The black horizontal line just under 15GB marks the InitiatingHeapOccupancyPercent (aka “IHOP”), at its default setting of 45%. The purple squares are the amount of Tenured space reclaimable at the start of a mixed GC event. The floor of the band of purple squares is 5% of heap, the value of G1HeapWastePercent.

The next graph shows a red “+” on each minute boundary and stands for the total percent of wall time the JVM was in STW and doing no useful work. The overall time spent in GC for this HBase instance for this time period is 15-25%. For reference, an application tier server spending 20%+ time in GC is considered stuck in “GC Hell” and in desperate need of tuning.

Tuning #1 Goal: Lower Total Time in GC - Action: Raise IHOP

One aspect that stands out clearly in the previous graph of heap sizes is that the working set is well above the IHOP. Tenured being higher than IHOP generally results in an overabundance of MPCMC runs (wastes CPU) and consequently an overabundance of Mixed GC cycles resulting in a higher ratio of expensive GC events vs cheap GC events. By moving IHOP a bit higher than Tenured, we expect fewer Mixed GC events to reclaim larger amounts of space, which should translate to less overall time spent in STW.

Raising the IHOP value on an HBase instance, the following before/after (above/below) graphs show that indeed the frequency of Mixed GC events drops dramatically while the reclaimable amount rises.

Considering that at least half of Mixed GC events on this instance took 200-400ms, we expected the reduced amount of Mixed GC events to outweigh any increase in individual Mixed GC times, such that overall GC time would drop. That expectation held true, as overall time spent in GC dropped from 4-12% to 1-8%.

The following graphs show before/after on the STW times for all Mixed GC events. Note the drastic drop in frequency while individual STW times don't seem to change.

Result: Success

This test was considered successful. We made the change across all HBase clusters to use a significantly larger IHOP value than the default of 45%.

Tuning #2 Goal: Lower Total Time in GC - Action: Increase Eden

Fixing the IHOP value to be higher than working set was basically fixing a misconfiguration. There was very little doubt that nonstop MPCMC + Mixed GC events was an inefficient behavior. Increasing Eden size, on the other hand, had a real possibility of increasing all STW times, both Minor and Mixed. GC times are driven by the amount of data being copied (surviving) from one epoch to the next, and databases like HBase are expected to have very large caches. A 10+ GB cache could very well have high churn and thus high object survival rates.

The effective Eden size for our HBase clusters is driven by the minimum Eden value G1NewSizePercent because the MaxGCPauseMillis target of 50ms is never met.

For this test, we raised Eden from 9% to 20% through G1NewSizePercent.

Effects on Overall GC Time

Looking at the following graphs, we see that overall time spent in GC may have dropped a little for this one hour time window from one day to the next.

Individual STW times

Looking at the STW times for just the Minor GC events there is a noticeable jump in the floor of STW times.

To-space Exhaustion Danger

As mentioned in the G1GC Foundational blog post, G1ReservePercent is ignored when the minimum end of the Eden range is used. The working set on a few of our HBase clusters is in the 70-75% range, which combined with a min Eden of 20% would leave only 5-10% of heap free for emergency circumstances. The downside of running out of free space, thus triggering To-space Exhaustion, is a 20+ sec GC pause and the effective death of the HBase instance. While the instance would recover, the other HBase instances in the cluster would consider it long dead before the GC pause completed.

Result: Failure

The overall time spent in GC did drop a little as theoretically expected, unfortunately the low end of Minor GC stop the world times increased by a similar percent. In addition, the risk for To-space exhaustion increased. The approach of increasing G1NewSizePercent to reduce overall time spent in GC didn't look promising and was not rolled out to any clusters.

Tuning #3 Goal: Reduce Individual STW Durations - Action: SurvivorRatio and MaxTenuringThreshold

In the previous tuning approach, we found that STW times increased as Eden size increased. We took some time to dig a little deeper into Survivor space to determine if there was any To-space overflow or if objects could be promoted faster to reduce the amount of object copying being done. To collect the Survivor space tenuring distribution data in the GC logs we enabled PrintTenuringDistribution and restarted a few select instances.

To-space overflow is the phenomenon where the Survivor To space isn't large enough to fit all the live data in Eden at the end of an epoch. Any data collected after Survivor To is full is added to Free regions, which are then added to Tenured. If this overflow is transient data, putting it in Tenured is inefficient as it will be expensive to clean out. If that was the case, we'd need to increase SurvivorRatio.

On the other hand, consider a use case where any object that survives two epochs will also survive ten epochs. In that case, by ensuring that any object that survives a second epoch is immediately promoted to Tenured, we would see a performance improvement since we wouldn’t be copying it around in the GC events that followed.

Here’s some data collected from the PrintTenuringDistribution parameter:

Desired survivor size 192937984 bytes, new threshold 2 (max 15)
- age 1: 152368032 bytes, 152368032 total
- age 2: 147385840 bytes, 299753872 total
[Eden: 2656.0M(2656.0M)->0.0B(2624.0M) Survivors: 288.0M->320.0M Heap: 25.5G(32.0G)->23.1G(32.0G)]

An Eden size of 2656 MB with SurvivorRatio = 8 (default) yields a 2656/8 = 332 MB survivor space. In the example entries we see enough room to hold two ages of survivor objects. The second age here is 5mb smaller than the first age, indicating that in the interval between GC events, only 5/152 = 3.3% of the data was transient. We can reasonably assume the other 97% of the data is some kind of caching. By setting MaxTenuringThreshold = 1, we optimize for the 97% of cached data to be promoted to Tenured after surviving its second epoch and hopefully shave a few ms of object copy time off each GC event.

Result: Theoretical Success

Unfortunately we don't have any nice graphs available to show these effects in isolation. We consider the theory sound and rolled out MaxTenuringThreshold = 1 to all our clusters.

Tuning #4 Goal: Eliminate Long STW Events - Action: G1MixedGCCountTarget & G1HeapWastePercent

Next, we wanted to see what we could do about eliminating the high end of Mixed GC pauses. Looking at a 5 minute interval of our Mixed GC STW times, we saw a pattern of sharply increasing pauses across each cycle of 6 mixed GCs:

That in and of itself should not be considered unusual, after all that behavior is how the G1GC algorithm got it's name. Each Mixed GC event will evacuate 1 / G1MixedGCCountTarget of the high-waste regions (regions with the most non-live data). Since it prioritizes regions with the most garbage, each successive Mixed GC event will be evicting regions with more and more live objects. The chart shows the performance effects of clearing out regions with more and more live data: the Mixed event times start at 100ms at the beginning of a mixed GC cycle and range upwards past 600ms by the end.

In our case, we were seeing occasional pauses at the end of some cycles that were several seconds long. Even though they were rare enough that our average pause time was reasonable, pauses that long are still a serious concern.

Two levers in combination can be used together to lessen the “scraping the bottom of the barrel” effect of cleaning up regions with a lot of live data:

G1HeapWastePercent: default (5) → 10. Allow twice as much wasted space in Tenured. Having 5% waste resulted in 6 of the 8 potential Mixed GC events occurring in each Mixed GC cycle. Bumping to 10% waste should chop off 1-2 more of the most expensive events of the Mixed GC cycle.

G1MixedGCCountTarget: default (8) → 16. Double the target number of Mixed GC events each Mixed GC cycle, but halve the work done by each GC event. Though it’s an increase to the number of GC events that are Mixed GC events, STW times of individual Mixed events should drop noticeably.

In combination, we expected doubling the target count to drop the average Mixed GC time, and increasing the allowable waste to eliminate the most expensive Mixed GC time. There should be some synergy, as more heap waste should also mean regions are on average slightly less live when collected.

Waste heap values of 10% and 15% were both examined in a test environment. (Don’t be alarmed by the high average pause times - this test environment was running under heavy load, on less capable hardware than our production servers.)

Above: 10% heap waste; below: 15% heap waste:

The results are very similar. 15% performed slightly better, but in the end we decided that 15% waste was unnecessarily much. 10% was enough to clear out the "scraping the bottom of the barrel" effect such that the 1+ sec Mixed GC STW times all but disappeared in production.

Result: Success

Doubling G1MixedGCCountTarget from 8 to 16 and G1HeapWastePercent from 5 to 10 succeeded in eliminating the frequent 1s+ Mixed GC STW times. We kept these changes and rolled them out across all our clusters.

Tuning #5 Goal: Stop Losing Nodes: Heap Size and HBase Configs

While running load tests to gauge the effects of the parameters above, we also began to dig into what looked like evidence of memory leaks in a production cluster. In the following graph we see the heap usage slowly grow over time until To-space Exhaustion, triggering a Full GC with a long enough pause to get the HBase instance booted from the cluster and killed:

We've got several HBase clusters, and only one cluster occasionally showed this type of behavior. If this issue were a memory leak, we'd expect the issue to arise more broadly, so it looks like HBase is using more memory in this cluster than we expected. To understand why, we looked into the heap breakdown of our RegionServers. The vast majority of an HBase RegionServer’s Tenured space is allocated to three main areas:

  • Memstore: region server’s write cache; default configuration caps this at 40% of heap.
  • Block Cache: region server’s read cache; default config caps at 40% of heap.
  • Overhead: the vast majority of HBase’s in-memory overhead is contained in a “static index”. The size of this index isn’t capped or configurable, but HBase assumes this won’t be an issue since the combined cap for memstore and block cache can’t exceed 80%.

We have metrics for the size of each of these, from the RegionServer’s JMX stats: “memStoreSize,” “blockCacheSize,” and “staticIndexSize.” The stats from our clusters show that HBase will use basically all of the block cache capacity you give it, but memstore and static index sizes depend on cluster usage and tables. Memstore fluctuates over time, while static index size depends on the RegionServer’s StoreFile count and average row key size.

It turned out, for the cluster in question, that the HBase caches and overhead combined were actually using more space than our JVM was tuned to handle. Not only were memstore and block cache close to capacity—12 GB block cache, memstore rising past 10GB - but the static index size was unexpectedly large, at 6 GB. Combined, this put desired Tenured space at 28+ GB, while our IHOP was set at 24 GB, so the upward trend of our Tenured space was just the legitimate memory usage of the RegionServer.

With this in mind, we judged the maximum expected heap use for each cluster’s RegionServers by looking at the cluster maximum memstore size, block cache size, and static index size over the previous month, and assuming max expected usage to be 110% of each value. We then used that number to set the block cache and memstore size caps (hfile.block.cache.size and hbase.regionserver.global.memstore.size) in our HBase configs.

The fourth component of Tenured space is the heap waste, in our case 10% of the heap size. We could now confidently tune our IHOP threshold by summing the max expected memstore, block cache, static index size, 10% heap for heap waste, and finally 10% more heap as a buffer to avoid constant mixed GCs when working set is maxed (as described in Tuning #1).

However, before we went ahead and blindly set this value, we had to consider the uses of heap other than Tenured space. Eden requires a certain amount of heap (determined by G1NewSizePercent), and a certain amount (default 10%) is Reserved free space. IHOP + Eden + Reserved must be ≤ 100% in order for a tuning to make sense; in cases where our now-precisely-calculated IHOP was too large for this to be possible, we had to expand our RegionServer heaps. To determine minimum acceptable heap size, assuming 10% Reserved space, we used this formula:

Heap ≥ (M + B + O + E) ÷ 0.7

    • M = max expected memstore size
    • B = max expected block cache size
    • O = max expected overhead (static index)
    • E = minimum Eden size

When those four components add up to ≤ 70% of the heap, then there will be enough room for 10% Reserved space, 10% heap waste, and 10% buffer between max working set and IHOP.

Result: Success

We reviewed memory usage of each of our clusters and calculated correct heap sizes and IHOP thresholds for each. Rolling out these changes immediately ended the To-space Exhaustion events we were seeing on the problematic cluster.

Tuning #6 Goal: Eliminate Long STW Events - Action: Increase G1 Region Size

We’d rolled out HBase block cache & memstore config changes, changes to G1HeapWastePercent and G1MixedGCCountTarget, and an increase in heap size on a couple clusters (32 GB → 40+ GB) to accommodate larger IHOP. In general things were smooth, but there were still occasional Mixed GC events taking longer than we were comfortable with, especially on the clusters whose heap had increased. Using gc_log_visualizer, we looked into what phase of Mixed GC was the most volatile and noticed that Scan RS times correlated:

A few Google searches indicated that Scan RS time output in the GC logs is the time taken examining all the regions referencing the tenured regions being collected. In our most recent tuning changes, heap size had been bumped up, however the G1HeapRegionSize remained fixed at 16 MB. Increasing the G1HeapRegionSize to 32 MB eliminated those high scan times:

Result: Success

Halving the G1 region count cleared out the high volatility in Scan RS times. According to G1GC documentation, the ideal region count is 2048, so 16 MB regions were perfect for a 32 GB heap. However, this tuning case led us to believe that for HBase heaps without a clear choice of region size, in our case 40+ GB, it’s much better to err on the side of fewer, larger regions.

Tuning #7 Goal: Preventing RegionServer To-space Exhaustion - Action: Extra Heap as Buffer

At this point, our RegionServers were tuned and running much shorter and less frequent GC Events. IHOP rested above Tenured while Tenured + Eden remained under the target of 90% total heap. Yet once in awhile, a RegionServer would still die from a To-space exhaustion triggered Full GC event as shown in the following graph.

It looks like we did everything right—there’s lot’s of reclaimable space and Tenured space drops well below IHOP with each Mixed GC. But right at the end, heap usage spikes up and we hit To-space Exhaustion. And while it’s likely the HBase client whose requests caused this problem could be improved to avoid this*, we can’t rely on our various HBase clients to behave perfectly all the time.

In the scenario above, very bursty traffic caused Tenured space to fill up the heap before the MPCMC could complete and enable a Mixed GC run. To tune around this, we simply added heap space**, while adjusting IHOP and G1NewSizePercent down to keep them at the same GB values they had been at before. By doing this we increased the buffer of free space in the heap above our original 10% default, for additional protection against spikes in memory usage.

Result: Success

Increasing heap buffer space on clusters whose HBase clients are known to be occasionally bursty has all but eliminated Full GC events on our RegionServers.


* Block cache churn correlates very closely with time spent in Mixed GC events on our clusters (see chart below). A high volume of Get and Scan requests with caching enabled unnecessarily (e.g. requested data isn’t expected to be in cache and isn’t expected to be requested again soon) will increase cache churn as data is evicted from cache to make room for the Get/Scan results. This will raise the RegionServer’s time in GC and could contribute to instability as described in this section.

Chart: % time in Mixed GC is in yellow (left axis); MB/sec cache churn is in blue (right axis):

** ** Another potential way to tune around this issue is by increasing ConcGCThreads (default is 1/4 * ParallelGCThreads). ConcGCThreads is the number of threads used to do the MPCMC, so increasing it could mean the MPCMC finishes sooner and the RegionServer can start a Mixed GC before Tenured space fills the heap. At HubSpot we’ve been satisfied with the results of increasing our heap size and haven’t tried experimenting with this value.

Overall Results: Goals Achieved!

After these cycles of debugging and tuning G1GC for our HBase clusters, we’ve improved performance in all the areas we were seeing problems originally:

  • Stability: no more To-space Exhaustion events causing Full GCs.
  • 99th percentile performance: greatly reduced frequency of long STW times.
  • Avg. performance: overall time spent in GC STW significantly reduced.

Summary: How to Tune Your HBase Cluster

Based on our findings, here’s how we recommend you tune G1GC for your HBase cluster(s):

Before you start: GC and HBase monitoring

  • Track block cache, memstore and static index size metrics for your clusters in whatever tool you use for charts and monitoring, if you don’t already. These are found in the RegionServer JMX metrics:
    • memStoreSize
    • blockCacheSize
    • staticIndexSize”  

  • You can use our collectd plugin to track G1GC performance over time, and our gc_log_visualizer for insight on specific GC logs. In order to use these you’ll have to log GC details on your RegionServers:
    • -Xloggc:$GC_LOG_PATH
    • -verbosegc
    • -XX:+PrintGC
    • -XX:+PrintGCDateStamps
    • -XX:+PrintAdaptiveSizePolicy
    • -XX:+PrintGCDetails
    • -XX:+PrintGCApplicationStoppedTime
    • -XX:+PrintTenuringDistribution
    • Also recommended is some kind of GC log rotation, e.g.:
      • -XX:+UseGCLogFileRotation
      • -XX:NumberOfGCLogFiles = 5
      • -XX:GCLogFileSize=20M

Step 0: Recommended Defaults

  • We recommend the following JVM parameters and values as defaults for your HBase RegionServers (as explained in Original GC Tuning State):
    • -XX:+UseG1GC
    • -XX:+UnlockExperimentalVMOptions
    • -XX:MaxGCPauseMillis = 50
      • This value is intentionally very low and doesn’t actually represent a pause time upper bound. We recommend keeping it low to pin Eden to the low end of its range (see Tuning #2).
    • -XX:-OmitStackTraceInFastThrow
    • -XX:ParallelGCThreads = 8+(logical processors-8)(5/8)
    • -XX:+ParallelRefProcEnabled
    • -XX:+PerfDisableSharedMem
    • -XX:-ResizePLAB 

Step 1: Determine maximum expected HBase usage

  • As discussed in the Tuning #5 section, before you can properly tune your cluster you need to know your max expected block cache, memstore, and static index usage.
    • Using the RegionServer JMX metrics mentioned above, look for the maximum value of each metric across the cluster:
      • Maximum block cache size.
      • Maximum memstore size.
      • Maximum static index size.
    • Scale each maximum by 110%, to accommodate even for slight increase in max usage. This is your usage cap for that metric: e.g. a 10 GB max recorded memstore → 11 GB memstore cap.
      • Ideally, you’ll have these metrics tracked for the past week or month, and you can find the maximum values over that time. If not, be more generous than 110% when calculating memstore and static index caps. Memstore size especially can vary widely over time. 

Step 2: Set Heap size, IHOP, and Eden size

  • Start with Eden size relatively low: 8% of heap is a good initial value if you’re not sure.
    • -XX:G1NewSizePercent = 8
    • See Tuning #2 for more about Eden sizing.
      • Increasing Eden size will increase individual GC pauses, but slightly reduce overall % time spent in GC.
      • Decreasing Eden size will have the opposite effect: shorter pauses, slightly more overall time in GC.

  • Determine necessary heap size, using Eden size and usage caps from Step 1:
    • From Tuning #5: Heap ≥ (M + B + O + E) ÷ 0.7
      • M = memstore cap, GB
      • B = block cache cap, GB
      • O = static index cap, GB
      • E = Eden size, GB
    • Set JVM args for fixed heap size based on the calculated value, e.g:
      • -Xms40960m -Xmx40960m

  • Set IHOP in the JVM, based on usage caps and heap size:
    • IHOP = (memstore cap’s % heap + block cache cap’s % heap + overhead cap’s % heap + 20)
    • -XX:InitiatingHeapOccupancyPercent IHOP

Step 3: Adjust HBase configs based on usage caps

  • Set block cache cap and memstore cap ratios in HBase configs, based on usage caps and total heap size. In hbase-site.xml:
    • hfile.block.cache.size → block cache cap ÷ heap size
    • hbase.regionserver.global.memstore.size → memstore cap ÷ heap size 

Step 4: Adjust additional recommended JVM flags for GC performance

  • From Tuning #3:
    • -XX:MaxTenuringThreshold = 1

  • From Tuning #4:
    • -XX:G1HeapWastePercent = 10
    • -XX:G1MixedGCCountTarget = 16

  • From Tuning #6:
    • -XX:G1HeapRegionSize = #M
    • # must be a power of 2, in range [1..32].
    • Ideally, # is such that: heap size ÷ # MB = 2048 regions.
    • If your heap size doesn’t provide a clear choice for region size, err on the side of fewer, larger regions. Larger regions reduce GC pause time volatility.

Step 5: Run it!

  •  Restart your RegionServers with these settings and see how they look.

    • Remember that you can adjust Eden size as described above, to optimize either for shorter individual GCs or for less overall time in GC. If you do, make sure to maintain Eden + IHOP ≤ 90%.

    • If your HBase clients can have very bursty traffic, consider adding heap space outside of IHOP and Eden (so that IHOP + Eden adds up to 80%, for example).

      • Remember to update % and ratio configs along with the new heap size.

      • Details and further suggestions about this found in Tuning #7. 

Further reference:

Saturday Apr 23, 2016

HDFS HSM and HBase: Conclusions (Part 7 of 7)

This is part 7 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)  

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions


There are many things to consider when choosing the hardware of a cluster. According to the test results, in the SSD-related cases the network utility between DataNodes is larger than 10Gbps. If you are using a 10Gbps switch, the network will be the bottleneck and impact the performance. We suggest either extending the network bandwidth by network bonding, or upgrading to a more powerful switch with a higher bandwidth. In cases 1T_HDD and 1T_RAM_HDD, the network utility is lower than 10 Gbps in most time, using a 10 Gbps switch to connect DataNodes is fine.

In all 1T dataset tests, 1T_RAM_SSD shows the best performance. Appropriate mix of different types of storage can improve the HBase write performance. First, write the latency-sensitive and blocking data to faster storage, and write the data that are rarely compacted and accessed to slower storage. Second, avoid mixing types of storage with a large performance gap, such as with 1T_RAM_HDD.

The hardware design issue limits the total disk bandwidth which makes there is hardly superiority of eight SSDs than four SSDs. Either to enhance hardware by using HBA cards to eliminate the limitation of the design issue for eight SSDs or to mix the storage appropriately. According to the test results, in order to achieve a better balance between performance and cost, using four SSDs and four HDDs can achieve a good performance (102% throughput and 101% latency of eight SSDs) with a much lower price. The RAMDISK/SSD tiered storage is the winner of both throughput and latency among all the tests, so if cost is not an issue and maximum performance is needed, RAMDISK(extremely high speed block device, e.g. NVMe PCI-E SSD)/SSD should be chosen.

You should not use a large number of flusher/compactor when most of data are written to HDD. The read and write shares the single channel per HDD, too many flushers and compactors at the same time can slow down the HDD performance.

During the tests, we found some things that can be improved in both HBase and HDFS.

In HBase, the memstore is consumed quickly when the WALs are stored in fast storage; this can lead to regular long GC pauses. It is better to have an offheap memstore for HBase.

In HDFS, each DataNode shares the same lock when creating/finalizing blocks. Any such slow operations in one DataXceiver can block any other operations of creating/finalizing blocks in other DataXceiver on the same DataNode no matter what storage they are using. We need to eliminate the blocking access across storage, and a finer grained lock mechanism to isolate the operations on different blocks is needed (HDFS-9668). And it will be good to implement a latency-aware VolumeChoosingPolicy in HDFS to remove the slow volumes from the candidates.

RoundRobinVolumeChoosingPolicy can lead to load imbalance in HDFS with tiered storage (HDFS-9608).

In HDFS, renaming a file to a different storage does not move the blocks indeed. We need to asynchronously move the HDFS blocks in such a case.


The authors would like to thank Weihua Jiang  – who is the previous manager of the big data team in Intel – for leading this performance evaluation, and thank Anoop Sam John(Intel), Apekshit Sharma(Cloudera), Jonathan Hsieh(Cloudera), Michael Stack(Cloudera), Ramkrishna S. Vasudevan(Intel), Sean Busbey(Cloudera) and Uma Gangumalla(Intel) for the nice review and guidance.

HDFS HSM and HBase: Issues (Part 6 of 7)

This is part 6 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Issues and Improvements

This section describes the issues we find during tests.


Long-time BLOCKED threads in DataNode

In 1T_RAM_HDD test, we can observe a substantial 0 throughput period in the YCSB client. After deep-diving into the thread stack, we find many threads of DataXceiver are stuck in a BLOCKED state for a long time in DataNode. We can observe such things in other cases too, but it is most often in 1T_RAM_HDD.

In each DataNode, there is a single instance of FsDatasetImpl where there are many synchronized methods, DataXceiver threads use this instance to achieve synchronization when creating/finalizing blocks. A slow creating/finalizing operation in one DataXceiver thread can block other creating/finalizing operations in all other DataXceiver threads. The following table shows the time consumed by these operations in 1T_RAM_HDD:

Synchronized methods

Max exec time (ms)

in light load

Max wait time (ms)

in light load

Max exec time (ms)

in heavy load

Max wait time (ms)

in heavy load
















Table 13. DataXceiver threads and time consumed

We can see that both execution time and wait time on the synchronized methods increased dramatically along with the increment of the system load. The time to wait for locks can be up to tens of seconds. Slow operations usually come from the slow storage such as HDD. It can hurt the concurrent operations of creating/finalizing blocks in fast storage so that HDFS/HBase cannot make better use of tiered storage.

A finer grained lock mechanism in DataNode is needed to fix this issue. We are working on this improvement now (HDFS-9668).

Load Imbalance in HDFS with Tiered Storage

In tiered storage cases, we find that the utilization are not the same among volumes of the same storage type when using the policy RoundRobinVolumeChoosingPolicy. The root cause is that in RoundRobinVolumeChoosingPolicy it uses a shared counter to choose volumes for all storage types. It might be unfair when choosing volumes for a certain storage type, so the volumes in the tail of the configured data directories have a lower chance to be written.

The situation becomes even worse when there are different numbers of volumes of different storage types. We have filed this issue in JIRA (HDFS-9608) and provided a patch.

Asynchronous File Movement Across Storage When Renaming in HDFS

Currently data blocks in HDFS are stored in different types of storage media according to pre-specified storage policies when creating. After that data blocks will remain where they were until an external tool Mover in HDFS is used. Mover scans the whole namespace, and moves the data blocks that are not stored in the right storage media as the policy specifies.

In a tiered storage, when we rename a file/directory from one storage to another different one, we have to move the blocks of that file or all files under that directory to the right storage. This is not currently provided in HDFS.

Non-configurable Storage Type and Policy

Currently in HDFS both storage type and storage policy are predefined in source code. This makes it inconvenient to add implementations for new devices and policies. It is better to make them configurable.

No Optimization for Certain Storage Type

Currently there is no difference in the execution path for different storage types. As more and more high performance storage devices are adopted, the performance gap between storage types will become larger, and the optimization for certain types of storage will be needed.

Take writing certain numbers of data into HDFS as an example. If users want to minimize the total time to write, the optimal way for HDD may be using compression to save disk I/O, while for RAMDISK writing directly is more suitable as it eliminates the overheads of compression. This scenario requires configurations per storage type, but it is not supported in the current implementation.


Disk Bandwidth Limitation

In the section 50GB Dataset in a Single Storage, the performance difference between four SSDs and eight SSDs is very small. The root cause is the total bandwidth available for the eight SSDs is limited by upper level hardware controllers. Figure 16 illustrates the motherboard design. The eight disk slots connect to two different SATA controllers (Ports 0:3 - SATA and Port 0:3 - sSATA). As highlighted in the red rectangle, the maximum bandwidth available for the two controller in the server is 2*6 Gbps = 1536 MB/s.

Figure 16. Hardware design of server motherboard

Maximum throughput for single disk is measured with FIO tool.

Read BW (MB/s)

Write BW (MB/s)










Table 14. Maximum throughput of storage medias

Note: RAMDISK is memory essentially and it does not go through the same controller as SSDs and HDDs, so it does not have the 2*6Gbps limitation. Data of RAMDISK is listed in the table for convenience of comparison.

According to Table 14 the writing bandwidth of eight SSDs is 447 x 8 = 3576 MB/s. It exceeds the controllers’ 1536 MB/s physical limitation, thus only 1536 MB/s are available for all eight SSDs. HDD is not affected by this limitation as their total bandwidth (127 x 8 = 1016 MB/s) is below the limitation. This fact greatly impacts the performance of the storage system.

We suggest one of the following:

  • Enhance hardware by using more HBA cards to eliminate the limitation of the design issue.

  • Use SSD and HDD together with an appropriate ratio (for example four SSDs and four HDDs) to achieve a better balance between performance and cost.

Disk I/O Bandwidth and Latency Varies for Ports

As described in section Disk Bandwidth Limitation, four of the eight disks connect to Ports 0:3 - SATA and the rest of them connect to Port 0:3 - sSATA, the total bandwidth of the two controllers is 12Gbps. We find that the bandwidth is not evenly divided to the disk channels.

We do a test, each SSD (sda, sdb, sdc and sdd connect to Port 0:3 - sSATA, sde, sdf, sdg and sdh connect to Ports 0:3 - SATA) is written by an individual FIO process. It’s expected that eight disks are written at the same speed, but according to the output of IOSTAT the bandwidth 1536 MB/s is not evenly divided to two controllers and eight disk channels. As shown in Figure 17, the four SSDs connected to Ports 0:3 - SATA obtain more I/O bandwidth (213.5MB/s*4) than the others (107MB/s*4).

We suggest that you consider the controller limitation and storage bandwidth when setting up a cluster. Using four SSDs and four HDDs in a node is a reasonable choice, and it is better to install the four SSDs to Port 0:3 - SATA.

Additionally, the disk with higher latency might take the same workload as the disks with lower latency in the existing VolumeChoosingPolicy. This would slow down the performance. We suggest to implement a latency-aware VolumeChoosingPolicy in HDFS.

Figure 17. Different write speed and await time of disks

Go to part 7, Conclusions

HDFS HSM and HBase: Experiment (continued) (Part 5 of 7)

This is part 5 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

1TB Dataset in a Single Storage

The performance for 1TB dataset in HDD and SSD is shown in Figure 6 and Figure 7. Due to the limitation of memory capability, 1TB dataset in RAMDISK is not tested.

Figure 6. YCSB throughput of a single storage type with 1TB dataset

Figure 7. YCSB latency of a single storage type with 1TB dataset

The throughput and latency on SSD are both better than HDD (134% throughput and 35% latency). This is consistent with 50GB data test.

The benefits gained for throughput by using SSD are different between 50GB and 1TB (from 128% to 134%), SSD gains more benefits in the 1TB test. This is because much more I/O intensive events such as compactions occur in 1TB dataset test than 50GB, and this shows the superiority of SSD in huge data scenarios. Figure 8 shows the changes of the network throughput during the tests.

Figure 8. Network throughput measured for case 1T_HDD and 1T_SSD

In 1T_HDD case the network throughput is lower than 10Gbps, and in 1T_SSD case the network throughput can be much larger than 10Gbps. This means if we use a 10Gbps switch in 1T_SSD case, the network should be the bottleneck.

Figure 9. Disk throughput measured for case 1T_HDD and 1T_SSD

In Figure 9, we can see the bottleneck for these two cases is disk bandwidth.

  • In 1T_HDD, at the beginning of the test the throughput is almost 1000 MB/s, but after a while the throughput drops down due to memstore limitation of regions caused by slow flush.

  • In 1T_SSD case, the throughput seems to be limited by a ceiling of around 1300 MB/s, nearly the same with the bandwidth limitation of SATA controllers. To further improve the throughput, more SATA controllers are needed (e.g. using HBA card) instead of more SSDs are needed.

During 1T_SSD test, we observe that the operation latencies on eight SSDs per node are very different as shown in the following chart. In Figure 10, we only include latency of two disks, sdb represents disks with a high latency and sdf represents disks with a low latency.

Figure 10. I/O await time measured for different disks

Four of them have a better latency than the other ones. This is caused by the hardware design issue. You can find the details in Disk I/O Bandwidth and Latency Varies for Ports. The disk with higher latency might take the same workload as the disks with lower latency in the existing VolumeChoosingPolicy, this would slow down the performance. We suggest to implement a latency-aware VolumeChoosingPolicy in HDFS.

Performance Estimation for RAMDISK with 1TB Dataset

We cannot measure the performance of RAMDISK with 1T dataset due to RAMDISK limited capacity. Instead we have to evaluate its performance by analyzing the results of cases HDD and SSD.

The performance between 1TB and 50GB dataset are pretty close in HDD and SSD.

The throughput difference between 50GB and 1TB dataset for HDD is


While for SSD the value is


If we make an average of the above values as the throughput decrease in RAMDISK between 50GB and 1TB dataset, it is around 2.15% ((2.89%+1.41%)/2=2.15%), thus the throughput for RAMDISK with 1T dataset should be

406577×(1+2.15%)=415318 (ops/sec)

Figure 11.  YCSB throughput estimation for RAMDISK with 1TB dataset

Please note: the throughput doesn’t drop much in 1 TB dataset cases compared to 50 GB dataset cases because they do not use the same number of pre-split regions. The table is pre-split to 18 regions in 50 GB dataset cases, and it is pre-split to 210 regions in the 1 TB dataset.

Performance for Tiered Storage

In this section, we will study the HBase write performance on tiered storage (i.e. different storage mixed together in one test). This would show what performance it can achieve by mixing fast and slow storage together, and help us to conclude the best balance of storage between performance and cost.

Figure 12 and Figure 13 show the performance for tiered storage. You can find the description of each case in Table 1.

Most of the cases that introduce fast storage have better throughput and latency. With no surprise, 1T_RAM_SSD has the best performance among them. The real surprise is that the throughput of 1T_RAM_HDD is worse than 1T_HDD (-11%) and 1T_RAM_SSD_All_HDD is worse than 1T_SSD_All_HDD (-2%) after introducing RAMDISK, and 1T_SSD is worse than 1T_SSD_HDD (-2%).

Figure 12.  YCSB throughput data for tiered storage

Figure 13.  YCSB latency data for tiered storage

We also investigate how much data is written to different storage types by collecting information from one DataNode.

Figure 14. Distribution of data blocks on each storage of HDFS in one DataNode

As shown in Figure 14, generally, more data are written to disks for test cases with higher throughput. Fast storage can accelerate the flush and compaction, which lead to more flushes and compactions.  Thus, more data are written to disks. In some RAMDISK-related cases, only WAL can be written to RAMDISK, and there are 1216 GB WALs written to one DataNode.

For tests without SSD (1T_HDD and 1T_RAM_HDD), we by purpose limiting the number of flush and compaction actions by using fewer flushers and compactors. This is due to limited IOPs capability of HDD, which lead to fewer flush & compactions. Too many concurrent reads and writes can hurt HDD performance which eventually slows down the performance.

Many BLOCKED DataNode threads can be blocked up to tens of seconds in 1T_RAM_HDD. We observe this in other cases as well, but it happens most often in 1T_RAM_HDD. This is because each DataNode holds one big lock when creating/finalizing HDFS blocks, these methods might take tens of seconds sometimes (see Long-time BLOCKED threads in DataNode), the more these methods are used (in HBase they are used in flusher, compactor, and WAL), the more often the BLOCKED occurs. Writing WAL in HBase needs to create/finalize blocks which can be blocked, and consequently users’ inputs are blocked. Multiple WAL with a large number of groups or WAL per region might also encounter this problem, especially in HDD.

With the written data distribution in mind, let’s look back at the performance result in Figure 12 and Figure 13. According to them, we have following observations:

  1. Mixing SSD and HDD can greatly improve the performance (136% throughput and 35% latency) compared to pure HDD. But fully replacing HDD with SSD doesn’t show an improvement (98% throughput and 99% latency) over mixing SSD/HDD. This is because the hardware design cannot evenly split the I/O bandwidth to all eight disks, and 94% data are written in SSD while only 6% data are written to HDD in SSD/HDD mixing case. This strongly hints a mix use of SSD/HDD can achieve the best balance between performance and cost. More information is in Disk Bandwidth Limitation and Disk I/O Bandwidth and Latency Varies for Ports.

  2. Including RAMDISK in SSD/HDD tiered storage has different results with 1T_RAM_SSD_All_HDD and 1T_RAM_SSD_HDD. The case 1T_RAM_SSD_HDD shows a result when there are only a few data written to HDD, which improves the performance over SSD/HDD mixing cases. The results of 1T_RAM_SSD_All_HDD when there are a large number of data written to HDD is worse than SSD/HDD mixing cases. This means if we distribute the data appropriately to SSD and HDD in HBase, we can gain a good performance when mixing RAMDISK/SSD/HDD.

  3. The RAMDISK/SSD tiered storage is the winner of both throughput and latency (109% throughput and 67% latency of pure SSD case). So, if cost is not an issue and maximum performance is needed, RAMDISK/SSD should be chosen.

The throughput decreases by 11% by comparing 1T_RAM_HDD to 1T_HDD. This is initially because 1T_RAM_HDD uses RAMDISK which consumes part of the RAM, which results in the OS buffer having less memory to cache the data.

Further, with 1T_RAM_HDD, the YCSB client can push data at very high speed, cells are accumulated very fast in memstore while the flush and compaction in HDD are slow, the RegionTooBusyException occurs more often (the figure below shows a much larger memstore in 1T_RAM_HDD than 1T_HDD), and we observe much longer GC pause in 1T_RAM_HDD than 1T_HDD, it can be up to 20 seconds in a minute.

Figure 15. Memstore size in 1T_RAM_HDD and 1T_HDD

Finally, as we try to increase the number of flushers and compactors, the performance even goes worse because of the reasons mentioned when explaining why we use less flusher and compactors in HDD-related tests (see Long-time BLOCKED threads in DataNode).

The performance reduction in 1T_RAM_SSD_All_HDD than 1T_SSD_All_HDD (-2%) is due to the same reasons mentioned above.

We suggest:

Implement a finer grained lock mechanism in DataNode.

  1. Use reasonable configurations for flusher and compactor, especially in HDD-related cases.

  2. Don’t use the storage that has large performance gaps, such as directly mixing RAMDISK and HDD together.

  3. In many cases, we can observe the long GC pause around 10 seconds per minute. We need to implement an off-heap memstore in HBase to solve long GC pause issues.

  4. Implement a finer grained lock mechanism in DataNode.

Go to part 6, Issues

HDFS HSM and HBase: Experiment (Part 4 of 7)

This is part 4 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel) 

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions


Performance for Single Storage Type

First, we will study the HBase write performance on each single storage type (i.e. no storage type mix). This test is to show the maximum performance each storage type can achieve and provide a guide to following hierarchy storage analysis.

We study the single-storage-type performance with two data sets: 50GB and 1TB data insertion. We believe 1TB is a reasonable data size in practice. And 50GB is used to evaluate the performance uplimit as HBase performance is typically higher when the data size is small. The 50GB size was chosen because we need to avoid data out of space in test. Further, due to RAMDISK limited capacity, we have to use a small data size when storing all data in RAMDISK.

50GB Dataset in a Single Storage

The throughput and latency by YCSB for 50GB dataset are listed in Figure 2 and Figure 3. As expected, storing 50GB data in RAMDISK has the best performance whereas storing data in HDD has the worst.

Figure 2. YCSB throughput of a single storage with 50GB dataset

Note: Performance data for SSD and HDD may be better than its real capability as OS can buffer/cache data in memory.

Figure 3. YCSB latency of a single storage with 50GB dataset

Note: Performance data for SSD and HDD may be better than its real capability as OS can buffer/cache data in memory.

For throughput, RAMDISK and SSD are higher than HDD (163% and 128% of HDD throughput, respectively). But the average latencies are dramatically lower than HDD (only 11% and 32% of HDD latency). This is expected as HDD has long latency on seek operation. The latency improvements of using RAMDISK, SSD over HDD are bigger than throughput improvement. This is caused by the huge access latency gap of different storage. The latency we measured of accessing 4KB raw data from RAM, SSD and HDD are respectively 89ns, 80µs and 6.7ms (RAM and SSD are about 75000x and 84x faster than HDD).

Now consider the data of hardware (disk, network and CPU in each DataNode) collected during the tests.

Disk Throughput

Throughput theoretically (MB/s)

Throughput measured (MB/s)



127 x 8 = 1016




447 x 8 = 3576







Table 10. Disk utilization of test cases

Note: The data listed in the table for 50G_HDD and 50G_SSD are the total throughput of 8 disks.

The disk throughput is decided by factors such as access model, block size and I/O depth. The theoretical disk throughput is measured by using performance-friendly factors; in real cases they won’t be that friendly and would limit the data to lower values. In fact we observe that the disk utilization usually goes up to 100% for HDD.

Network Throughput

Each DataNode is connected by a 20Gbps (2560MB/s) full duplex network, which means both receive and transmit speed can reach 2560MB/s simultaneously. In our tests, the receive and transmit throughput are almost identical, so only the receive throughput data is listed in Table 9.

Throughput theoretically (MB/s)

Throughput measured (MB/s)














Table 11. Network utilization of test cases

CPU Utilization








Table 12. CPU utilization of test cases

Clearly, the performance on HDD (50G_HDD) is bottlenecked on disk throughput. However, we can see that, for SSD and RAMDISK, neither disk, network nor CPU are the bottleneck.  So, the bottlenecks must be somewhere else. They can be somewhere in software (e.g. HBase compaction, memstore in regions, etc) or hardware (except disk, network and CPU).

To further understand why the utilization of SSD is so low and throughput is not as high as expected —only 132% of HDD, considering SSD has much higher theoretical throughput (447 MB/s vs. 127MB/s per disk) — we make an additional test to write 50GB data on four SSDs per node instead of eight SSDs per node.

Figure 4. YCSB throughput of a single storage type with 50 GB dataset stored in different number of SSDs

Figure 5. YCSB latency of a single storage type with 50 GB dataset stored in different number of SSDs

We can see that while the number of SSDs doubled, the throughput and latency of eight SSDs per node case (50G_8SSD) are improved to a much lesser extent (104% throughput and 79% latency) compared to four SSDs per node case (50G_4SSD). This means that the ability of SSDs are far from full use.

We made further dive and found that it is caused by current mainstream server hardware design. Currently, mainstream server has two SATA controllers which can support up to 12 Gbps bandwidth. This means that the total disk bandwidth is limited to around 1.5GB/s. That is the bandwidth of approximately 3.4 SSDs. You can find the details in the section Disk Bandwidth Limitation. So SATA controllers have become the bottleneck for eight SSDs per node. This explains why there is almost no improvement on YCSB throughput for eight SSDs per node. In the 50G_8SSD test, the disk is the bottleneck.

Go to part 5, Experiment (continued)

HDFS HSM and HBase: Cluster setup (Part 2 of 7)

This is part 2 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Cluster Setup

In all, five nodes are used in the testing. Figure 1 shows the topology of these nodes; the services on each node are listed in Table 3.

Figure 1. Cluster topology

  1. For HDFS, one node serves as NameNode, three nodes as DataNodes.

  2. For HBase, one HMaster is collocated together with NameNode. Three RegionServers are collocated with DataNodes.

  3. All the nodes in the cluster are connected to the full duplex 10Gbps DELL N4032 Switch. Network bandwidth for client-NameNode, client-DataNode and NameNode-DataNode is 10Gbps. Network bandwidth between DataNodes is 20Gbps (20Gbps is achieved by network bonding).







YCSB client






Table 3. Services run on each node


Hardware configurations of the nodes in the cluster is listed in the following tables.




Intel® Xeon® CPU E5-2695 v3 @ 2.3GHz, dual sockets


Micron 16GB DDR3-2133MHz, 384GB in total


Intel 10-Gigabit X540-AT2


Intel S3500 800G


Seagate Constellation™ ES ST2000NM0011 2T 7200RPM



Table 4. Hardware for DataNode/RegionServer

Note: OS is stored on an independent SSD (Intel® SSD DC S3500 240GB) for both DataNodes and NameNode. The number of SSD or HDD (OS SSD not included) in DataNode varies for different testing cases., See Section ‘Methodology’ for details.




Intel® Xeon® CPU E5-2697 v2 @ 2.70GHz, dual sockets


Micron 16GB DDR3-2133MHz, 260GB in total


Intel 10-Gigabit X540-AT2


Intel S3500 800G

Table 5. Hardware for NameNode/HBase MASTER



Intel Hyper-Threading Tech


Intel Virtualization


Intel Turbo Boost Technology


Energy Efficient Turbo


Table 6. Processor configuration





CentOS release 6.5













JVM Heap

NameNode:      32GB

DataNode:          4GB

HMaster:            4GB

RegionServer:  64GB



Table 7. Software stack version and configuration

NOTE: as mentioned in Methodology, HDFS and HBase have been enhanced to support this test


We use YCSB 0.3.0 as the benchmark and use one YCSB client in the tests.

This is the workload configuration for 1T dataset:

# cat ../workloads/1T_workload












And we use following command to start the YCSB client:

./ycsb load hbase-10 -P ../workloads/1T_workload -threads 200 -p columnfamily=family -p clientbuffering=true -s > 1T_workload.dat

Go to part 3, Tuning

HDFS HSM and HBase: Introduction (Part 1 of 7)

This is part 1 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions


As more and more fast storage types (SSD, NVMe SSD, etc.) emerge, a methodology is necessary for better throughput and latency when using big data. However, these fast storage types are still expensive and are capacity limited. This study provides a guide for cluster setup with different storage media.

In general, this guide considers the following questions:

  1. What is the maximum performance user can achieve by using fast storage?

  2. Where are the bottlenecks?

  3. How to achieve the best balance between performance and cost?

  4. How to predict what kind of performance a cluster can have with different storage combinations?

In this study, we study the HBase write performance on different storage media. We leverage the hierarchy storage management support in HDFS to store different categories of HBase data on different media.

Three different types of storage (HDD, SSD and RAMDISK) are evaluated. HDD is the most popular storage in current usages, SATA SSD is a faster storage which is more and more popular now. RAMDISK is used to emulate the extremely high performance PCI-e NVMe based SSDs and coming faster storage (e.g. Intel 3D XPoint® based SSD). Due to hardware unavailability, we have to use RAMDISK to perform this emulation. And we believe our results hold for PCI-e SSD and other fast storage types.

Note: RAMDISK is logical device emulated with memory. Files stored into RAMDISK will only be cached in memory.


We test the write performance in HBase with a tiered storage in HDFS and compare the performance when storing different HBase data into different storages. YCSB (Yahoo! Cloud Serving Benchmark, a widely used open source framework for evaluating the performance of data-serving systems) is used as the test workload.

Eleven test cases are evaluated in this study.  We split the table into 210 regions in 1 TB dataset cases to avoid region split at runtime, and we pre-split the table into 18 regions in 50 GB dataset cases.

The format of case names is <dataset size>_<storage type>.

Case Name







Store all the files in RAMDISK. We have to limit data size to 50GB in this case due to the capacity limitation of RAMDISK.


8 SSDs


Store all the files in SATA SSD. Compare the performance by 50GB data with the 1st case.


8 HDDs


Store all the files in HDD.


8 SSDs


Store all files in SATA SSD. Compare the performance by 1TB data with cases in tiered storage.


8 HDDs


Store all files in HDD. Use 1TB in this case.



8 SSDs


Store files in a tiered storage (i.e. different storage mixed together in one test), WAL is stored in RAMDISK, and all the other files are stored in SSD.



8 HDDs


Store files in a tiered storage, WAL is stored in RAMDISK, and all the other files are stored in HDD.


4 SSDs

4 HDDs


Store files in a tiered storage, WAL is stored in SSD, some smaller files (not larger than 1.5GB) are stored in SSD, and all the other files are stored in HDD including all archived files.


4 SSDs

4 HDDs


Store files in a tiered storage, WAL and flushed HFiles are stored in SSD, and all the other files are stored in HDD including all archived files and compacted files.



4 SSDs

4 HDDs


Store files in a tiered storage, WAL is stored in RAMDISK, some smaller files (not larger than 1.5GB) are stored in SSD, and all the other files are stored in HDD including all archived files.



4 SSDs

4 HDDs


Store files in a tiered storage, WAL is stored in RAMDISK, flushed HFiles are stored in SSD, and all the other files are stored in HDD including all archived files and compacted files.

Table 1. Test Cases

NOTE: In all 1TB test cases, we pre-split the HBase table into 210 regions to avoid the region split at runtime.

The metrics in Table 2 are collected during the test for performance evaluation.



Storage media level

IOPS, throughput (sequential/random R/W)

OS level

CPU usage, network IO, disk IO, memory usage

YCSB Benchmark level

Throughput, latency

Table 2. Metrics collected during the tests

Go to Part 2, Cluster Setup

Friday Apr 22, 2016

HDFS HSM and HBase: Tuning (Part 3 of 7)

This is part 3 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel) 

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Stack Enhancement and Parameter Tuning

Stack Enhancement

To perform the study, we made a set of enhancements in the software stack:

  • HDFS:

    • Support a new storage RAMDISK

    • Add file level mover support, a user can move blocks per file without scanning all metadata in NameNode

  • HBase:

    • WAL, flushed HFiles, HFiles generated in compactions, and archived HFiles can be stored in different storage

    • When renaming HFiles across storage, the blocks of that file would be moved to the target storage asynchronously

HDFS/HBase Tuning

This step is to find the best configurations for HDFS and HBase.

Known Key Performance Factors in HBase

These are the key performance factors in HBase:

  1. WAL: write ahead log to guarantee the non-volatility and consistency of the data. Each record that is inserted to HBase must be written to WAL which can slow down user operations. It’s latency-sensitive.

  2. Memstore and Flush: The records inserted into HBase are cached in memstore, and when reaches a threshold the memstore is flushed to a store file. Slow flush can lead to high GC (Garbage Collection) pause, and make memory usage reach the thresholds in regions and region server, which can block the user operations.

  3. Compaction and Number of Store Files: HBase compaction compacts small store files to a larger one which can reduce the number of store files and accelerate the reading, but it can generate heavy I/O and consume the disk bandwidth in runtime. Less compaction can accelerate the writing but generates too many store files, which slow down the reading. When there are too many store files, the memstore flush can be slowed down which can lead to a large memstore and further slow the user operations.

Based on this understanding, the following are the tuned parameters we finally used.







Table 8. HDFS configuration




3 for non-SSD test cases.

8 for all SSD related test cases.


5 for non-SSD test cases.

15 for all SSD related test cases.













Table 9. HBase configuration

Go to part 4, Experiment

Wednesday Apr 13, 2016

Investing In Big Data: Apache HBase

This is the fourth in a series of posts on "Why We Use Apache HBase", in which we let HBase users and developers borrow our blog so they can showcase their successful HBase use cases, talk about why they use HBase, and discuss what worked and what didn't.

Lars Hofhansl and Andrew Purtell are HBase Committers and PMC members. At Salesforce.com, Lars is a Vice President and Software Engineering Principal Architect, and Andrew is a Cloud Storage Architect.

An earlier version of the discussion in this post was published here on the Salesforce + Open Source = ❤ blog.

- Andrew Purtell

Investing In Big Data: Apache HBase

 By © Earth at Night / CC BY 3.0

The world is good at making data. You can see it in every corner of commerce and industry: everything we interact with is getting smarter, and producing a massive stream of readings, geolocations, images, and more. From medical devices to jet engines, it’s transforming every part of the modern world. And it’s accelerating!

Salesforce’s products are all about helping our customers connect with their customers. So obviously, a big part of that is equipping them to interact with all of the data a customer’s interactions might generate, in whatever quantities it shows up in. We do that in many ways, across all of our products. 

In this post, we’d like to zoom in on one particular Open Source data system we use and contribute heavily to: Apache HBase. If you’re not familiar with HBase, this post will start with a few high level concepts about the system, and then go into how it fits in at Salesforce. 

What IS HBase? 

HBase is an open source distributed database. It’s designed to store record-oriented data across a scalable cluster of machines. We sometimes refer to it as a “sparse, consistent, distributed, multi-dimensional, persistent, sorted map”. This usually makes people say “Wat??”. So, to break that down a little bit:

  • distributed :  rows are spread over many machines;
  • consistent :  it’s strongly consistent (at a per-row level);
  • persistent :  it stores data on disk with a log, so it sticks around;
  • sorted : rows are stored in sorted order, making seeks very fast;
  • sparse :  if a row has no value in a column, it doesn’t use any space;
  • multi-dimensional :  data is addressable in several dimensions: tables, rows, columns, versions, etc.

Think of it as a giant list of key / value pairs, spread over a large cluster, sorted for easy access.

People often refer to HBase as a “NoSQL” store–a term coined back in 2009 to refer to a big cohort of similar systems that were doing data storage without SQL (Structured Query Language). Contributors to the HBase project will tell you they have always disliked that term, though; a toaster is also NoSQL. It’s better to say HBase is architected differently than a typical relational database engine, and can scale out better for some use cases.

Google was among the first companies to move in this direction, because they were operating at the scale of the entire web. So, they long ago built their infrastructure on top of a new kind of system (BigTable, which is the direct intellectual ancestor of HBase). It grew from there, with dozens of Open Source variants on the theme emerging in the space of a few years: Cassandra, MongoDB, Riak, Redis, CouchDB, etc. 

Why use a NoSQL store, if you already have a relational database?

Let’s be clear: relational databases are terrific. They’ve been the dominant form of data storage on Earth for nearly three decades, and that’s not by accident. In particular, they give you one really killer feature: the ability to decompose the physical storage of data into different conceptual buckets (entities, aka tables), with relationships to each other … and to modify the state of many related values atomically (transactions). This incurs a cost (for finding and re-assembling the decomposed data when you want to read it) but relational database query planners have gotten so shockingly good that this is actually a perfectly good trade-off in most cases.

Salesforce is deeply dependent on relational databases. A majority of our row-oriented data still lives there, and they are integral to the basic functionality of the system. We’ve been able to scale them to massive load, both via our unique data storage architecture, and also by sharding our entire infrastructure (more on that in the The Crystal Shard, a post by Ian Varley on the Salesforce blog). They’re not going anywhere. On the contrary, we do constant and active research into new designs for scaling and running relational databases. 

But, it turns out there are a subset of use cases that have very different requirements from relational data. In particular, less emphasis on webs of relationships that require complex transactions for correctness, and more emphasis on large streams of data that accrue over time, and need linear access characteristics. You certainly can store these in an RDBMS. But, when you do, you’re essentially paying a penalty (performance and scale limitations) for features you don’t need; a simpler design can scale more powerfully.

It’s for those new use cases–and all the customer value they unlock–that we’ve added HBase to our toolkit.

Of all the NoSQL stores, why HBase?

This question comes up a lot: people say, “I heard XYZ-database is web scale, you should use that!”. The world of “polyglot persistence” has produced a boatload of choices, and they all have their merits. In fact, we use almost all of them somewhere in the Salesforce product suite: Cassandra, Redis, CouchDB, MongoDB, etc.

To choose HBase as a key area of investment for Salesforce Core, we went through a pretty intense “bake-off” process that included evaluations of several different systems, with experiments, spikes, POCs, etc. The decision ultimately came down to three big points for us:

  1. HBase is a strongly consistent store. In the CAP Theorem, that means it’s a (CP) store, not an (AP) store. Eventual consistency is great when used for the right purpose, but it can tend to push challenges up to the application developer. We didn’t think we’d be able to absorb that extra complexity, for general use in a product with such a large surface area. 

  2. It’s a high quality project. It did well in our benchmarks and tests, and is well respected in the community. Facebook built their entire Messaging infrastructure on HBase (as well as many other things), and the Open Source community is active and friendly.

  3. The Hadoop ecosystem already had an operational presence at Salesforce. We’ve been using Hadoop in the product for ages, and we already had a pretty good handle on how to deploy and operate it. HBase can use Hadoop’s distributed filesystem for persistence and offers first class integration with MapReduce (and, coming soon, Spark), so is a way to level up existing Hadoop deployments with modest incremental effort.

Our experience was (and still is) that HBase wasn’t the easiest to get started with, but it was the most dependable at scale; we sometimes refer to it as the “industrial strength” scalable store, ready for use in demanding enterprise situations, and taking issues like data durability and security extremely seriously. 

HBase (and its API) is also broadly used in the industry. HBase is an option on Amazon’s EMR, and is also available as part of Microsoft’s Azure offerings. Google Cloud includes a hosted BigTable service sporting the de-facto industry standard HBase client API. (It’s fair to say the HBase client API has widespread if not universal adoption for Hadoop and Cloud storage options, and will likely live on beyond the HBase engine proper.) 

When does Salesforce use HBase?

There are three indicators we look at when deciding to run some functionality on HBase. We want things that are big, record-oriented, and transactionally independent

By “big”, we mean things in the ballpark of hundreds of millions of rows per tenant, or more. HBase clusters in the wild have been known to grow to thousands of compute nodes, and hundreds of Petabytes, which is substantially bigger than we’d ever grow a single one of our instance relational databases to. (None of our individual clusters are that big yet, mind you.) Data size is not only not a challenge for HBase, it’s a desired feature!

By “record-oriented”, we mean that it “looks like” a database, not a file store. If your data can be modeled as big BLOBs, and you don’t need to independently read or write small bits of data in the middle of those big blobs, then you should use a file store.

Why does it matter if things are record-oriented? After all, HBase is built on top of HDFS, which is … a file store. The difference is that the essential function that HBase plays is specifically to let you treat big immutable files as if they were a mutable database. To do this magic, it uses something called a Log Structured Merge Tree, which provides for both fast reads and fast writes. (If that seems impossible, read up on LSMs; they’re legitimately impressive.)

And, what do we mean by “transactionally independent”? Above, we described that a key feature of relational databases is their transactional consistency: you can modify records in many different tables and have those modifications either commit or rollback as a unit. HBase, as a separate data store, doesn’t participate in these transactions, so for any data that spans both stores, it requires application developers to reason about consistency on their own. This is doable, but it is tricky. So, we prefer to emphasize those cases where that reasoning is simple by design

(For the hair-splitters out there, note that HBase does offer consistent “transactions”, but only on a single-row basis, not across different rows, objects, or (most importantly) different databases.)

One criterion for using HBase that you’ll notice we didn’t mention is that as a “NoSQL” store, you wouldn’t use it for something that called for using SQL. The reason we didn’t mention that is that it isn’t true! We built and open sourced a library called “Phoenix” (which later became Apache Phoenix) which brings a SQL access layer to HBase. So, as they say, “We put the SQL back in NoSQL”.

We treat HBase as a “System Of Record”, which means that we depend on it to be every bit as secure, durable and available as our other data stores. Getting it there required a lot of work: accounting for site switching, data movement, authenticated access between subsystems, and more. But we’re glad we did!

What features does Salesforce use HBase for?

To give you a concrete sense of some of what HBase is used for at Salesforce, we’ll mention a couple of use cases briefly. We’ll also talk about more in future posts.

The main place you may have seen HBase in action in Salesforce to date is what we call Salesforce Shield. Shield is a connected suite of product features that enable enterprise businesses to encrypt data and track user actions. This is a requirement in certain protected sectors of business, like Health and Government, where compliance laws dictate the level of retention and oversight of access to data.

One dimension of this feature is called Field Audit Trail (FAT). In Salesforce, there’s an option that allows you to track changes made to fields (either directly by users through a web UI or mobile device, or via other applications through the API). This historical data is composed of “before” and “after” values for every tracked field of an object. These stick around for a long time, as they’re a matter of historical record. That means that if you have data that changes very frequently, this data set can grow rapidly, and without any particular bound. So we use HBase as a destination for moving older sets of audit data over, so it’s still accessible via the API, but doesn’t have any cost to the relational database optimizer. The same principle applies to other data that behaves the same way; we can archive this data into HBase as a cold store.

Another part of shield that uses HBase is the Event Monitoring feature. The initial deployment was based on Hadoop, and culls a subset of application log lines, making them available to customers directly as downloadable files. New work is in progress to capture some event data into HBase, like Login Forensics, in real time, so it can be queried interactively. This could be useful for security and audit, limit monitoring, and general usage tracking. We’re also introducing an asynchronous way to query this data so you can work with the potentially large resulting data sets in sensible ways.

More generally, the Platform teams at Salesforce have been cooking up a general way for customers to use HBase. It’s called BigObjects (more here). BigObjects behave in most (but not all) ways like standard relational-database-backed objects, and are ideally suited for use cases that require ingesting large amounts of read-only data from external systems: log files, POS data, event data, clickstreams, etc.

We’re adding more features that use HBase in every release, like improving mention relevancy in Chatter, tracking Apex unit test results, caching report data, and more. We’ll post follow-ups here and on the Salesforce Engineering Medium Channel about these as they evolve.

Contributing To HBase

So, we’ve got HBase clusters running in data centers all over the world, powering new features in the product, and running on commodity hardware that can scale linearly.

But perhaps more importantly, Salesforce has put a premium on not just being a user of HBase, but also being a contributor. The company employs multiple committers on the Apache project, including both of us (Lars is a committer and PMC member, and Andrew is an Apache VP and PMC Chair). We’ve written and committed hundreds of patches, including some major features. As committers, we’ve reviewed and committed thousands of patches, and served as release managers for the 0.94 release (Lars) and 0.98 release (Andrew).

We’ve also presented dozens of talks at conferences about HBase, including HBaseCon and OSCON.

We’re committed to doing our work in the community, not in a private fork except where temporarily needed for critical issues. That’s a key tenet of the team’s OSS philosophy — no forking! — that we’ll talk about more in an upcoming post on the Salesforce Engineering Medium Channel.

By the way, outside of the Core deployment of HBase that we’ve described here, there are actually a number of other places in the larger Salesforce universe where HBase is deployed as well, including Data.com, the Marketing Cloud (more here), and Argus, an internal time-series monitoring service. More on these uses soon, too!


We’re happy to have been a contributing part of the HBase community these last few years, and we’re looking forward to even more good stuff to come.

Thursday Dec 17, 2015

Offheaping the Read Path in Apache HBase: Part 1 of 2

Detail on the work involved making it so the Apache HBase read path could work against off heap memory (without copy).[Read More]

Offheaping the Read Path in Apache HBase: Part 2 of 2

by HBase Committers Anoop Sam John, Ramkrishna S Vasudevan, and Michael Stack

This is part two of a two part blog. Herein we compare before and after off heaping. See part one for preamble and detail on work done.

Performance Results

There were two parts to our performance measurement.  

  1. Using HBase’s built-in Performance Evaluation (PE) tool.  

  2. Using YCSB to measure the throughput.

The PE test was conducted on a single node machine.  Table is created and loaded with 100 GB of data.  Table has one CF and one column per row. Each cell value size is 1K. Configuration of the node :

System configuration

CPU : Intel(R) Xeon(R) CPU with 8 cores.
RAM : 150 GB
JDK : 1.8

HBase configuration

  <value> 104448 </value>

20% of the heap is allocated to the L1 cache (LRU cache). When L2 is enabled, L1 holds no data, just index and bloom filter blocks. 102 GB off heap space is allocated to the L2 cache (BucketCache)

Before performing the read tests, we have made sure that all data is loaded into BucketCache so there is no i/o.  The read workloads of PE tool run with multiple threads.  We considerthe average completion time for each of the threads to do the required reads.

1. Each thread does 100 row multi get operations for 1000000 times. We can see that there is a 55 – 82 % reduction in average run time. See the graph below for the test for 5 to 75 threads reading. Y axis shows average completion time, in seconds, for one thread.

Here each thread is doing 100000000 row get and converting this to throughput numbers we can see

5 Threads

10 Threads

20 Threads

25 Threads

50 Threads

75 Threads

Throughput Without Patch







Throughput With HBASE-11425







Throughput gain







So without the patch case, at the 20 threads level, the system goes to peak load situation and throughput starts to fall off. But with HBASE-11425 this is not the case and even with 75 threads. It is mostly linear scaling with more loading.  The major factor which helps us here is reduced GC activity.

2. Each thread is doing a range scan of 10000 rows with filtering of all the data on the server side. The filtering is done to see the server side gain alone and avoid any impact of network and/or client app side bottleneck. Each thread is doing the scan operation 10000 times. We can see that there is 55 – 74 % reduction in average run time of each thread. See below the graph for the test for 5 to 75 threads reading. Y axis shows average completion time, in seconds, for one thread.

3.  Another range scan test is performed with part of data returned back to client.  The test will return 10% of total rows back to the client and the remaining rows are filtered out at the server. The below graph is for a test with 10, 20, 25 and 50 threads. Again Y axis gives the average thread completion time, in seconds. The gain is  28 – 39% latency.

The YCSB test is done in same cluster with 90 GB of data. We had similar system configuration and HBase configuration as for the PE tests.

The YCSB setup involves creating a table with a single column family with around 90 GB of data. There are 10 columns each with 100 bytes of data (each row having 1k of data). All the readings are taken after ensuring that the entire data set is in the BucketCache. The multi get test includes each thread doing 100 row gets and 5000000 such operations. Range scan does random range scans with 1000000 operations.


With HBASE-11425(Ops/sec)















With HBASE-11425(Ops/Sec)














For multi get there is 20 – 160 % throughput gain whereas for range scan it is 20 - 80%.

With HBASE-11425 we can see there is linear throughput increase with more threads whereas old code starts performing badly when more threads (See when 50 and 75 threads)

GC graphs

With HBASE-11425, we serve data directly from the off heap cache rather than copy each of the blocks on heap for each of the reads.  So we should be doing much better with respect to GC on the  RegionServer side.  Below are the GC graph samples taken on one RegionServer during the PE test run.  We can clearly notice that with the changes in HBASE-11425, we are doing much less (better) GC.

MultiGets – without HBASE-11425 (25 threads)


Multigets – with HBASE-11425(25 threads)


ScanRange10000 – without HBASE-11425 (20 threads)


ScanRange10000 – with HBASE-11425 (20 threads)


Future Work

One implication of the findings above is that we should run with off heap Cache on always. We were reluctant to do this in the past when reads from the off heap cache took longer. This is no longer the case. We will look into making this option on by default. Also, this posting has described our conversion of the read pipeline to make it run with offheap buffers. Next up, naturally, would be making the write pipeline offheap.


Some parts of this work made it into branch-1 but to run with a fully off heap read path, you will have to wait on the HBase 2.0 release which should be available early next year (2016). Enabling L2 (BucketCache) in off heap mode will automatically turn on the off heap mechanisms described above. We hope you enjoy these improvements made to the Apache HBase read path.

Thursday Oct 08, 2015

Imgur Notifications: From MySQL to HBase

This is the third in a series of posts on "Why We Use Apache HBase", in which we let HBase users and developers borrow our blog so they can showcase their successful HBase use cases, talk about why they use HBase, and discuss what worked and what didn't.

Carlos J. Espinoza is an engineer at Imgur.

An earlier version of the discussion in this post was published here on the Imgur Engineering blog.

- Andrew Purtell

Imgur Notifications: From MySQL to HBase

Imgur is a heavy user of MySQL. It has been a part of our stack since our beginning. However, with our scale it has become increasingly difficult to throw more features at it. For our latest feature upgrade, we re-implemented our notifications system and migrated it over from MySQL to HBase. In this post, I will talk about how HBase solved our use case and the features we exploited.

To add some context, previously we supported two types of notifications: messages and comment replies, all stored in MySQL. For this upgrade, we decided to support several additional notification types. We also introduced rules around when a notification should be delivered. This change in spec made it challenging to continue with our previous model, so we started from scratch.

Early in the design phase, we envisioned a world where MySQL remained the primary store. We put together some schemas, some queries and, for the better, stopped ourselves from making a huge mistake. We had to create a couple columns for every type of notification. Creating a new notification type afterwards would mean a schema change. Our select queries would require us to join against other application tables. We designed an architecture that could work, but we would sacrifice decoupling, simplicity, scalability, and extensibility.

Some of our notifications require that they only be delivered to the user once a milestone is crossed. For instance, if a post reaches 100 points, Imgur will notify the user at the 100 point mark. We don’t want to bother users with 99 notifications in between. So, at scale, how do we know that the post has reached 100 points?

A notification could have multiple milestones. We only want to deliver once the milestone is hit.

Considering MySQL is our primary store for posts, one way to do it is to increment the points counter in the posts table, then execute another query fetching the row and check if the points reached the threshold of 100. This approach has a few issues. Increment and then fetch is a race condition. Two clients could think they reached 100 points, delivering two notifications for the same event. Another problem is the extra query. For every increment, we must now fetch the volatile column, adding more stress to MySQL.

Though it is technically possible to do this in MySQL using transactions and consistent read locks, lock contention would make it possibly very expensive with votes as it’s one the most frequent operations on our site. Seeing as we already use HBase in other parts of our stack, we switched gears and we built our system on top of it. Here is how we use it to power notifications in real time and at scale.

Sparse Columns

At Imgur, each notification is composed of one or more events. The following image of our notifications dropdown illustrates how our notifications are modeled.

As illustrated, each notification is composed of one or more events. A notification maps to a row in HBase, and each event maps to multiple columns, one of which is a counter. This model makes our columns very sparse as different types of notifications have different types of events. In MySQL, this would mean a lot of NULL values. We are not tied to a strict table schema, so we can easily add other notification types using the same model.

Atomic Increments

HBase has an atomic increment operation that returns the new value in the same call. It’s a simple feature, but our implementation depends on this operation. This allows our notification delivery logic on the client to be lightweight: increment and only deliver the notification if and only if a milestone is crossed. No extra calls. In some cases, this means we now keep two counters. For instance, the points count in the MySQL table for posts, and the points count in HBase for notifications. Technically, they could get out of sync, but this is an edge case we optimize for.

Another benefit of doing increments in HBase is that it allows us to decouple the notifications logic from the application logic. All we need to know to deliver a notification is whether its counter has crossed a pre-defined threshold. For instance, we no longer need to know how to fetch a post and get its point count. HBase has allowed us to solve our problem in a more generic fashion.

Fast Table Scans

We also maintain a secondary order table. It stores notification references ordered by when they were last delivered using reversed timestamps. When users open their notifications dropdown, we fetch their most recent notifications by performing a table scan limited by their user ID. We can also support scanning for different types of notifications by using scan filters.

Other Benefits

With HBase we gain many other benefits, like linear scalability and replication. We can forward data to another cluster and run batch jobs on that data. We also get strong consistency. It is extremely important for notifications to be delivered exactly once when a milestone is crossed. We can make that guarantee knowing that HBase has strong row level consistency. It’s likely that we’ll use versioning in the future, but even without use of it, HBase is a great choice for our use case.

Imgur notifications is a young project, and we’ll continue to make improvements to it. As it matures and we learn more from it, I hope to share what we’ve built with the open source community.

Thursday Sep 03, 2015

Medium Data and Universal Data Systems

This is the second in a series of posts on "Why We Use Apache HBase", in which we let HBase users and developers borrow our blog so they can showcase their successful HBase use cases, talk about why they use HBase, and discuss what worked and what didn't.

Matthew Hunt is an architect at Bloomberg who works on Portfolio Analytics and Bloomberg infrastructure.

An earlier version of the discussion in this post was published here on the High Scalability blog.

- Andrew Purtell


Why Bloomberg uses HBase

The Bloomberg terminal is the dominant news and analytics platform in finance, found on nearly every trader's desk around the world. We are actively working to consolidate around fewer, faster, and simpler systems and see HBase as part of a broader whole, along with computation frameworks such as Spark, resource schedulers like Mesos, virtualization and containerization in OpenStack and Docker, and storage with Ceph and HDFS. This is a substantial undertaking: Bloomberg has tens of thousands of databases and applications built over 30 years, and a team of 4000 engineers.

While HBase was designed for a specific purpose originally, we believe it will grow to become a general purpose database - in fact, a universal one.

What is a universal database?

The great Michael Stonebraker wrote a few years ago that the era of "one size fits all" databases was over. Given that relational, time series, and hierarchical systems have long co-existed makes it questionable whether there ever was a one size fits all period. In fact, the era of one size fits all is, if anything, dawning. Stonebraker's central point - purpose built databases are often significantly better at their purpose then a general alternative - is true. But in the broader landscape, it doesn't matter. The universal database is one that is good enough to subsume nearly all tasks in practice. The distinction between data warehouse, offline batch reporting, and OLTP is already blurry and will soon vanish for most use cases. Slightly more formally, the universal database is one that is fast and low latency for small and medium problems, scales elastically to larger sizes transparently, has clear well understood access semantics as in MPP SQL systems, and connects to standard computation engines such as Spark. Many systems are converging towards this already, and we think HBase is one of the future contenders in this space.

In a strong vote of confidence, HBase has seen increasing adoption and endorsement from major players such as Apple and Google, and Spark integration from Huawei, Cloudera, and Hortonworks, continuing to build and foster the large community that success requires. At the end of the day, the technology projects that succeed are the ones that attract a critical mass. Longevity and future direction are essential considerations in large scale design decisions.

But the future can be cloudy. We also use HBase because, with minor modifications, it is a good fit for practical applications we had initially, once it received some buffing for what I call the medium data problem.

Why don't we use HBase?

No honest engineering writeup is complete without a perusal of alternatives. Many fine and more mature commercial systems exist, along with good open alternatives such as Cassandra. While we think HBase has critical mass and a broad and growing community, other systems have strengths too. HBase can be cumbersome to set up and configure, lacks a standard high level access language incorporated by default, and still has work to do for efficient interfaces to Spark. But the pace of development is swift, and we believe this is just scratching the surface of what is to come as it matures. I look forward to a subsequent blog posting that detailing these future improvements and our next generation of systems.

To the community

All real projects stem from the combined efforts of many. I would like to thank the incredible community of people who have participated in moving things forward, from the strong leadership of Michael Stack and Andrew Purtell to the fine people at Cloudera and Hortonworks and Databricks. Strong managerial support inside of Bloomberg has impressively cleared the path. Carter Page at Google and Andrew Morse at Apple offered early encouragement. And in particular, Sudarshan Kadambi has been a true partner in crime for shared dreams, vision and execution. Thank you all.

What follows is a deep dive into our original HBase projects and the thinking that lay behind them.

MBH, 8/2015

Search engines and large web vendors serve hundreds of millions of requests for straightforward web lookups. Bloomberg serves hundreds of thousands of sophisticated financial users where each request has tight tolerances and may be computationally expensive. Can the systems originally designed for the former work for the latter?

"Big Data" systems continue to attract substantial funding, attention, and excitement. As with many new technologies, they are neither a panacea, nor even a good fit for many common uses. Yet they also hold great promise, and Bloomberg has been contributing behind the scenes to strengthen their use for what we term medium data – high terabytes on modest clusters - and real world scenarios as we build towards simpler universal systems. The easiest way to understand these changes is to start with a bit of background on the origin of both these technologies and our company.

[This was taken from a talk called "Hadoop at Bloomberg", but it’s important to understand that we see this as part of a broader landscape that encompasses tools such as Spark, Mesos, and Docker which are working to provide frameworks for making many commodity machines work together.]

Bloomberg Products and Challenges

Bloomberg is the world’s premier provider of information and analytics to the financial world. Walk into nearly any trading floor in the world, and you will find "Bloomberg Terminals" in use. Simply put, data is our business, and our 4,000 engineers have built many systems over the last 30 years to address the stringent performance and reliability needs of the financial industry. We are always looking at ways to improve our core infrastructure, including consolidating many purpose built disparate systems to reduce complexity.

Our main product is the Bloomberg Professional service, which many refer to simply as “The Terminal”, shown below.


While the name of the Bloomberg Terminal is carried over from when we built hardware, it is analogous to a browser in the cloud: there is an "address bar" where a function name can be typed in, display areas where the results are shown, and the logic and heavy lifting is done server side in our data centers. Of course we don’t refer to them as browsers and clouds because we’ve been doing it since long before those words existed, but since the browser is ubiquitous, this should give everyone a good idea of the basics. There are also mobile versions available on the iPhone, iPad, and Android. There are perhaps 100,000 functions and tens of thousands of databases. Below is a sample of a few of them.


Chart of a single security
Storm track showing oil wells and shipping and companies likely to be affected
News and sentiment analysis of twitter feed
Portfolio and risk analytics

Bloomberg’s origins were in providing electronic analytics for bonds and other securities. "Security" is just a fancy word for instruments you can trade like stocks or bonds. It transformed a world that was paper driven and information poor into one in which prices were available at the touch of a button. As nearly every school, road, airport, and other piece of modern infrastructure is financed by bond issuance, this was an important and useful innovation that reduced the cost of financing. Prior to that time, the cost of a bond was more or less what your sales guy said it was.

In the early 1980s when the company began, there were few off the shelf computer technologies adequate to the task. Commercial databases were weak, and personal business computers nonexistent. Even the IBM PC itself was only announced two months before Bloomberg LP was founded. The name of our main product, the Bloomberg Terminal, reflects our history of needing to hardware at our outset, along with developing the software and database systems in house. And so out of necessity we did.

As the company expanded, its infrastructure grew around powerful Unix machines and fast reliable services designed to furnish almost any kind of analysis there is about a company or security. The company grew and prospered, becoming one of the largest private companies in the United States, and entering news and other lines of business. Owning our own systems has conferred substantial benefits aside from simply making our existence possible. Independent of the flexibility, not paying vendors such as Oracle billions of dollars for our tens of thousands of databases is good business.

As time has passed, the sophistication of our usage has grown. Interactive functions are frequently geared towards complicated risk and other multi-security analytics, which require far more data and computation at a time per user and request. This puts a strain on systems architected when interactive requests were for charting of the price of a small number of instruments. Many groups within Bloomberg have addressed by copying systems and caching large quantities of data. There are also different data systems for different kinds of usage – one for streaming pricing data, another for historical data, and so on. Complexity kills. The more pieces and moving parts there are, the harder to build and maintain. Can we consolidate around a small number of open platforms? We believe we can.

Big Data Origins

Modern era big data technologies are a solution to an economics problem faced by Google and others a decade ago. Storing, indexing, and responding to searches against all web pages required tremendous amounts of disk space and computer power. Very powerful machines, fast SAN storage, and data center space were prohibitively expensive. The solution was to pack cheap commodity machines as tightly together as possible with local disks.

This addressed the space and hardware cost problem, but introduced a software challenge. Writing distributed code is hard, and with many machines comes many failures, and so a framework was also required to take care of such problems automatically for the system to be viable. Google pioneered one of the first viable commodity machine platforms to do this, which is part of why they went on to dominance rather than bankruptcy. Fortunately for the rest of us, they published how they did it, and while there are more sophisticated means available today, it’s a great place to start an understanding of concepts.

The essential elements of this system are a place to put files, and a way to perform computations. The landmark Google papers introduced GFS for storage and MapReduce for computation. MapReduce had high reliability and throughput, owning records for benchmarks like sorting a petabyte of data at the time. However, its response time is far too long for interactive tasks, and so BigTable, a low latency database, was also introduced. Google published papers detailing these systems, but did not release their software. This in a nutshell explains Hadoop, which began as the open source implementation of these Google systems, supported by Yahoo and other large web vendors facing the same challenges at the time.

Understanding the why and how these systems were created also offers insight into some of their weaknesses. Their goal was to make crunching hundreds of petabytes of data circa 2004 on thousands of machines both cost effective and reliable, especially for batch computations. A system that ropes together thousands of machines for an overnight computation is no good if it gets most of the way there but never finishes.

And therein lies the problem. Most people don’t have petabytes of data to crunch at a time. The number of companies who have faced this has been restricted to a handful of internet giants. Similarly, these solutions are architected around the limitations of commodity machines of the day, which had single core processors, spinning rust disks, and far less memory. A decade later, high core counts, SSDs or better, and large RAM footprints are common. And Hadoop in particular is also written in Java, which creates its own challenges for low latency performance.

Why Hadoop? It’s not just about Hadoop

Choosing the right system is more than a strictly technical one. Is it well supported? Will it be around for the future? Does it address enough problems to be worthwhile? X86 chips didn’t begin as the best server CPU. Linux was a toylike operating system at its outset. What has made them dominant is sustained effort, talent, and money from an array of companies and people. We believe the same will happen with some variant of these tools, immature though they may be. Intel has presumably invested for the same reason – if systems of the future are built around a distributed framework, they want to be in on the game.

Big data systems are widely used and money and talent have been pouring in. Recently, Cloudera raised $950 million dollars, of which $750 million came from Intel. Hortonworks has also had some impressive fundraising. More than a billion dollars into just two companies goes a long way. Skilled talent such as Michal Kornacker, who architected the F1 system at Google, Jeff Hammerbacher, the founder of the data science team at Facebook, and Mike Olson, an old database hand, and members of the Yahoo data sciences team such as Arun Murthy help the ecosystem.

It’s not just about Hadoop for us. We group Spark, Mesos, Docker, Hadoop, and other tools into the broader group of "tools for making many machines work together seamlessly". Hadoop has some useful properties that make it a good starting point for discussion in addition to being useful in its own right.

We’d like to consolidate our systems around an open commodity framework for storage and computation that has broad support. Hadoop and its ilk provide potential but are immature and have weaknesses when it comes to taking advantage of modern hardware, consistently fast performance, high resiliency, and in the number of people who know it well.

So that’s the background. Now on to a specific use case and how we’ve tackled it with HBase.

HBase Overview


HBase is short for the "Hadoop database", and it’s the implementation of Google’s BigTable that was designed for interactive requests that MapReduce wasn’t suited for. Many of us think of databases as things with many tables and indexes that support SQL and relational semantics. HBase has a simpler model designed to spread across many machines.

The easiest analogy is with printed encyclopedias. Each volume of the encyclopedia lists the range of words that can be found inside, such as those beginning with the letters A to C. In HBase, the equivalent of a volume is called a region. A region server is a process that hosts some number of regions, much as a bookcase can hold several books. There is typically one region server process per physical computer. The diagram above shows three machines, each with one region server process, each of which has two regions.

A table in HBase can have many columns and rows but only one primary key. The table is always physically sorted by that key, just as the encyclopedia is, and many of its characteristics follow from this. For example, writes are naturally parallel. A-C words go to one machine, and C-F to another. Elastic scalability follows too: drop another machine in, and you can move some of the regions to it. And if a region gets two big, it splits into two new ones, and so very large datasets can be accommodated. These are all positive properties.

But there are downsides, including the loss of things taken for granted in a relational database. HBase is less efficient per machine than a single server database. Updates of individual rows are atomic, but there aren’t all-or-nothing complex transactions that can be rolled back. There are no secondary indexes, and lookups not based on the primary key are much slower. Joins between multiple tables are cumbersome and inefficient. These systems are simply much less mature then relational databases, and understood by far fewer people. While HBase is growing in sophistication and maturity, if your problem fits in a relational database on a single machine, then HBase is almost certainly a poor choice.

When considering broad architectural decisions, it is important to consider alternate solutions. There are many strong alternatives beyond known relational systems. Products such as Greenplum, Vertica, SAP Hana, Sybase IQ and many others are robust, commercially supported, and considerably more efficient per node, with distributed MPP SQL or pure columnar stores. Cassandra offers a similar model to HBase and is completely open, so why not use it?

Our answer is that we have big picture concerns. We prefer open systems rather than being locked in to a single vendor. Hadoop and other systems are part of an expanding ecosystem of other services, such as distributed machine learning and batch processing, that we also want to benefit from. While the alternatives are excellent data stores, they only offer a smaller piece of the puzzle. And finally, we have actual use cases that happen to fit reasonably well and are able to address many of the downsides.

So on to one of our specific use cases.

Time series data and the Portfolio analytics challenge



Much of financial data is time series data. While the systems to manage this data have often been highly sophisticated and tuned, the data itself is very simple, and the problem has been one of speed and volume. Time series data for a security is generally just a security, a field, a date, and a value. The security could be a stock such as IBM, the field the price, and the date and value self-explanatory.

The reason there is a speed and volume issue is that there are thousands of possible fields – volume traded, for example – and tens of millions of securities. The universe of bonds alone is far larger then stocks, which we call equities. These systems have also historically been separated into intraday and historical data – intraday captures all the ticks during the day, while historical systems have end of day values that go back much farther in time. Unifying them was impractical in the past owing to different performance tradeoffs: intraday must manage a very high volume of incoming writes, while end of day has batch updates on market close but more data and searches in total.

The Bloomberg end of day historical system is called PriceHistory. This is a slight misnomer since price is just one of thousands of fields, but price is obviously an important one. This system was designed for the high volume of requests for single securities and quite impressive for its day. The response time for a year’s data for a field on a security is 5ms, and it receives billions of hits a day, around half a million a second at peak. It is a critical system for Bloomberg; failures there have the potential to disrupt the capital markets.

The challenge comes from applications like Portfolio Analytics that request far more data at a time. A bond portfolio can easily have tens of thousands of bonds in it, and a common calculation like attribution requires 40 fields per day per instrument. Performing attribution for a bond portfolio over a few years of data requires tens of millions of datapoints instead of a handful. When requesting tens of millions of datapoints, even a high cache hit rate of 99.9% will still have thousands of cache misses, and if the underlying system is backed by magnetic media, this means thousands of disk seeks. Multiply across a large user base making many requests and it’s clear to see why this created a problem for the existing price history system.

The solution a few years back was to move to all memory and solid state machines with highly compacted blobs split into separate databases. This approach worked and was 2-3 orders of magnitude faster – a giant leap – but less than ideal. Compacted blobs split across many databases is cumbersome.

Enter HBase.

HBase Suitability and Issues

Time series data is both very simple and lends itself to embarrassing parallelism. When fetching tens of millions of datapoints for a portfolio, the work can be arbitrarily divided among tens of millions of machines if need be. It doesn’t get any better than that in terms of potential parallelism. With a single main table trivially distributed, HBase should be a good fit – in theory.

Of course, in theory, theory and practice agree, but in practice, they don’t always. The data sets may work, but what are the obstacles? Performance, efficiency, maturity, and resiliency. What follows is how we went about tackling them.

Failure and recovery


The first and most critical problem was failure behavior. When an individual machine fails, data is not lost – it is replicated on several machines via HDFS – but there is still a serious drawback. At any given time, all reads and writes to any given region is managed by one and only one region server. If a region server dies, the failure will be detected and a replacement automatically brought back up, but until it does, no reads and writes are possible to the regions it managed.

The HBase community has made great strides in speeding up recovery time after a failure is detected, but this still leaves a crucial hole. The crux of the problem is that failure detection in a distributed system is ultimately accomplished by timeout – some period of time that elapses where a heartbeat or other mechanism fails. If this timeout is made too short, there will be false positives, which are also problematic.

In Bloomberg’s case, our SLAs are on the order of milliseconds. The shortest that timeouts can be made are on the order of many seconds. This is a fundamental mismatch even if recovery after failure detection becomes infinitely fast.

Discussions with Hadoop vendors resulted in several possible solutions, most of which involved running multiple copies of each database. Since Bloomberg’s operational policies already dictate that systems can survive the loss of entire datacenters and then several additional machines, this would have meant many running copies of each database, which we deemed too operationally complex. Ultimately, we were unhappy with the alternatives, and so we worked to change them.

If failover detection and recovery can never be fast enough, then the alternative is there must be standby nodes that can be queried in the event a region server goes down. The insight that made this practical in the short term is that the data already exists in multiple places, albeit with a possible slight delay in propagation. This can be used for standby region servers that at worst lag slightly behind but receive events in the same order, known as timeline consistency. This works for end of day systems like PriceHistory that are updated in batch.

Timeline consistent standby region servers were added to HBase as part of HBASE-10070.

Performance I: Distribution and parallelism


With failures addressed, though, there are still issues of raw performance and consistency. Here we’ll detail three experiments and outcomes on performance. Tests were run on a modest cluster of 11 machines with SSDs, dual Xeon E5’s, and 128 GB of RAM each. In the top right hand corner of the slide, two numbers are shown. The first is the average response time at the outset of the experiment, and the second the new time after improvements were made. The average request is for around 200,000 records that are random key-value lookups.

Part of the premise of HBase is that large Portfolio requests can be divided up and tackled among existing machines in a cluster. This is part of why standby region servers to address failures are mandated – if a request hits every server, and one doesn’t respond for a minute or more, it’s bad news. The failure of any one server would stall every such request.

So the next thing to tackle was parallelism and distribution. The first issue was in dividing work evenly: given a large request, did each machine in the cluster get the same size chunk of work? Ideally, requesting 1000 rows from ten machines would get 100 rows from each. With a naive schema, the answer is no: if the row key is security + something else, then everything from IBM will be in one region, and IBM is hit more often than other securities. This phenomena is known as hotspotting.

The solution is to take advantage of the properties of HBase as a fix for the problem. Data in HBase is always kept physically sorted by the row key. If the row key itself is hashed, and that hash used as a prefix, then the rows will be distributed differently. Imagine the row key is security+year+month+field. Take an MD5 hash of that, and use that as a prefix, and any one record for IBM is equally likely to be on any server. This resulted in almost perfect distribution of work load. We don’t use the MD5 algorithm exactly, because it’s too slow and large, but the concept is the same.

With perfect distribution, we have perfect parallelism: With 11 servers each one was doing the same amount of work not just on average, but for every request. And if parallelism is good, then more parallelism is better, right? Yes - up to a point.

As mentioned earlier a general Hadoop weakness is that it was designed for older and less powerful machines. One of the ways this manifests itself is through low resource consumption on the server. At maximum throughput, none of the hardware was breathing hard – low CPU and disk and network, no matter how many configuration options were changed.

One way to address this is to run more region servers per machine, provided the machine has the capacity to do so. This will increase the number of total region servers and hence the level of fan out for parallelism, and if more parallelism is better, then response times will drop.

The chart from the slide shows the results. With 11 machines and 1 region server each – 11 in total – average response time was 260 ms. Going to 3 per machine, for 33 region servers in total, dropped the average time to 185 ms. 5 region servers per machine further improved to 160 ms, but at 10 per machine times got worse. Why?

The first answer that will probably spring to mind is that the capacity of the machine was exceeded; with 10 region server processes on a box, each with many threads, perhaps there weren’t enough cores. As it turns out, this is not the cause, and the actual answer is more subtle and interesting and resulted in another leap in performance that is applicable to many systems of this type.

But before we get into that, let’s talk about in place computation.

Performance II: In place computation


One of the tenets of big data is that the computation should be moved to the data and not the other way around. The idea here is straightforward. Imagine you need to know how many rows are in a large database table. You could fetch every row to your local client, and then iterate through and count – or you could issue a “select count(*) from ... “ if it’s a relational database. The latter is much more efficient.

It’s also much more easily said than done. Many problems can’t be solved this way at all, and many frameworks fall short even where the problem is known to be amenable. In the 2013 report on massive data analysis the National Research Council proposed seven computational giants for parallelism in an attempt to categorize the capabilities of a distributed system, a relevant and interesting read, and Hadoop can’t handle all of them. Fortunately for us, we have a simple use case that does make sense that let us again cut response times in half.

The situation is that there are multiple "sources" for data. Barclays and Merrill Lynch and many other companies provide data for bond pricing. The same bond for the same dates is often in more than one source, but the values don’t agree. Customers differ on the relative precedence of each, and so with each request include a waterfall which is just an ordering of what sources should be used in what order. Try to get all the data from the first source, and then for anything that’s missing, try the next source , and so on. On average, a request must go through 5 such sources until all the data is found.

In the world of separate databases, different sources are in different physical locations, and so this means trying the first database, pulling all the records back, seeing what’s missing, constructing a new request, and issuing that.

For HBase, we can again take advantage of physical sorting, support for millions of columns, and in-place computation through coprocessors, which are the HBase equivalent of classic stored procedures. By making the source and field columns with the security and date still the row key, they will all be collocated on the same region server. The coprocessor logic then becomes "find the first source for this row that matches for this security on this date, and return that." Instead of five round trips, there will always be one and only one. A small coprocessor of a few lines worked as expected, and average response time dropped to 85ms.

Now, on to the mystery from before.

Performance III: Synchronized disruption


The results from increasing the number of region servers left a mystery: Why did response times improve at first, and then worsen? The answer involves both Java and an important property of any high fan out distributed system.

A request that goes to more than one machine "fans out". The more machines hit, the higher the degree of fan out. In a request that fans out, how long does it take to get a response? The answer is that it takes as long as the slowest responder. With ten region servers per machines and 11 machines, each request fanned out to 110 processes. If 109 respond in a microsecond and 1 in 170 ms, the response time is 170 ms.

The immediate culprit turned out to be Java’s garbage collection, which freezes a machine until it completes. The higher the degree of fan-out, the greater the likelihood that any one region server process was garbage collecting at the time. At 110 regions, the odds were close to 1.

The solution here is devilishly simple. As long as the penalty must be paid if any one Java process garbage collects, why not just make them do it all at the same time? That way, more time will elapse where requests are issued and no garbage collection takes place. We wrote the simplest, dumbest test for this: a coprocessor with one line – System.gc() – and a timer that called it every 30 seconds. This dropped our average response time from 85 ms to 60 ms on the first go.

Many system developers and C programmer eyeballs will collectively roll as this is a Java specific problem. While Java GC is an issue, this has wider applicability. It is a general case of what Google’s Jeff Dean calls Synchronized Disruption. Requests in any parallel system that fan out suffer from the laggard, and so things that slow down individual machines should be done at the same time. For more details, see the excellent writeup "Achieving Rapid Response Times in Large Online Services" covering this and other properties of high fan out systems. That said, Java is the basis now of quite a number of high fan out computing systems, including Hadoop, HBase, Spark, and Solr. Synchronizing garbage collection holds potential for them as well.

Conclusions And Next Steps

As these experiments have shown, performance from HBase can be improved. The changes described here alone cut average runtime response times to a quarter of what they were, with plenty of room left for further improvements. Faster machines are available that would halve response times again. (Incidentally, there is a myth that the speed and performance of an individual machine is irrelevant since it’s better to scale horizontally with more machines. Individual machine speed still matters immensely: the length of a garbage collection cycle, for example, is linearly proportional to the single threaded CPU performance, and even if your system doesn’t use Java, the laggard responder may well be affected by machine speed.) HBase produces far more garbage than it ought; this too can be fixed. And reissuing requests to an alternate server after a delay – a technique also mentioned in Dean’s paper – can also lower not just average response times, but reduce or eliminate outliers too, which is equally important.

We’ll do some of these, but we are nearing the point of diminishing returns. Below a threshold of some tens of milliseconds, the difference is imperceptible for a person.

On the flip side, there are applications for which this would never make sense. High frequency trading, where the goal is to shave microseconds away, is a poor fit.

More importantly, our goal is more than performance alone. While this setup with HBase provides a thousandfold improvement on write performance and 3x faster reads, it is a connector to a broader distributed platform of tools and ideas.

Wednesday Jun 10, 2015

Scalable Distributed Transactional Queues on Apache HBase

This is the first in a series of posts on "Why We Use Apache HBase", in which we let HBase users and developers borrow our blog so they can showcase their successful HBase use cases, talk about why they use HBase, and discuss what worked and what didn't.

Today's entry is a guest post by Terence Yim, a Software Engineer at Cask, responsible for designing and building realtime processing systems on Hadoop/HBase, originally published here on the Cask engineering blog.

- Andrew Purtell

A real time stream processing framework usually involves two fundamental constructs: processors and queues. A processor reads events from a queue, executes user code to process them, and optionally writing events to another queue for additional downstream processors to consume. Queues are provided and managed by the framework. Queues transfer data and act as a buffer between processors, so that the processors can operate and scale independently. For example, a web server access log analytics application can look like this:


One key differentiator among various frameworks is the queue semantics, which commonly varies along these lines:

  • Delivery Guarantee: At least once, at most once, exactly-once.
  • Fault Tolerance: Failures are transparent to user and automatic recovery.
  • Durability: Data can survive failures and restarts.
  • Scalability: Characteristics and limitations when adding more producers/consumers.
  • Performance: Throughput and latency for queue operations.

In the open-source Cask Data Application Platform (CDAP), we wanted to provide a real-time stream processing framework that is dynamically scalable, strongly consistent and with an exactly-once delivery guarantee. With such strong guarantees, developers are free to perform any form of data manipulation without worrying about inconsistency, potential reprocessing or failure. It helps developers build their big data application even if they do not have strong distributed systems background. Moreover, it is possible to relax these strong guarantees to trade-off for higher performance if needed; it is always easier than doing it the other way around.

Queue Scalability

The basic operations that can be performed on a queue are enqueue and dequeue. Producers write data to the head of the queue (enqueue) and consumers read data from the tail of the queue (dequeue). We say a queue is scalable when you enqueue faster as a whole by adding more producers and dequeue faster as a whole by adding more consumers. Ideally, the scaling is linear, meaning doubling the amount of producers/consumers will double the rate of enqueue/dequeue and is only bounded by the size of the cluster. In order to support linear scalability for producers, the queue needs to be backed by a storage system that scales linearly with the number of concurrent writers. For consumers to be linearly scalable, the queue can be partitioned such that each consumer only processes a subset of queue data.

Another aspect of queue scalability is that it should scale horizontally. This means the upper bound of the queue performance can be increased by adding more nodes to the cluster. It is important because it makes sure that the queue can keep working regardless of cluster size and can keep up with the growth in data volume.

Partitioned HBase Queue

We chose Apache HBase as the storage layer for the queue. It is designed and optimized for strong row-level consistency and horizontal scalability. It provides very good concurrent write performance and its support of ordered scans fits well for a partitioned consumer. We use the HBase Coprocessors for efficient scan filtering and queue cleanup. In order to have the exactly-once semantics on the queue, we use Tephra’s transaction support for HBase.

Producers and consumers operate independently. Each producer enqueues by performing batch HBase Puts and each consumer dequeues by performing HBase Scans. There is no link between the number of producers and consumers and they can scale separately.

The queue has a notion of consumer groups. A consumer group is a collection of consumers partitioned by the same key such that each event published to the queue is consumed by exactly one consumer within the group. The use of consumer groups allows you to partition the same queue with different keys and to scale independently based on the operational characteristics of the data. Using the access log analytics example above, the producer and consumer groups might look like this:


There are two producers running for the Log Parser and they are writing to the queue concurrently. On the consuming side, there are two consumer groups. The Unique User Counter group has two consumers, using UserID as the partitioning key and the Page View Counter group contains three consumers, using PageID as the partitioning key.

Queue rowkey format

Since an event emitted by a producer can be consumed by one or more consumer groups, we write the event to one or more rows in an HBase table, with one row designated for each consumer group. The event payload and metadata are stored in separate columns, while the row key follows this format:


The two interesting parts of the row key are the Partition ID and the Entry ID. The Partition ID determines the row key prefix for a given consumer. This allows consumer to read only the data it needs to process using a prefix scan on the table during dequeue. The Partition ID consists of two parts: a Consumer Group ID and an Consumer ID. The producer computes one partition ID per consumer group and writes to those rows on enqueue.

The Entry ID in the row key contains the transaction information. It consists of the producer transaction write pointer issued by Tephra and a monotonic increasing counter. The counter is generated locally by the producer and is needed to make row key unique for the event since a producer can enqueue more than one event within the same transaction.

On dequeue, the consumer will use the transaction writer pointer to determine if that queue entry has been committed and hence can be consumed. The row key is always unique because of the inclusion of a transaction write pointer and counter. This makes producers operate independently and never have write conflicts.

In order to generate the Partition ID, a producer needs to know the size and the partitioning key of each consumer group. The consumer groups information is recorded transactionally when the application starts as well as when there are any changes in group size.

Changing producers and consumers

It is straightforward to increase or decrease producers since each producer operates independently. Adding or removing producer processes will do the job. However, when the size of consumer group needs to change, coordination is needed to update the consumer group information correctly. The steps can be summarized by this diagram:


Pausing and resuming are fast operations as they are coordinated using Apache ZooKeeper and executed in parallel. For example, with the web access log analytics application we mentioned above, changing the consumer groups information may look something like this:


With this queue design, the enqueue and dequeue performance is on par with batch HBase Puts and HBase Scans respectively, with some overhead for talking to the Tephra server. That overhead can be greatly reduced by batching multiple events in the same transaction.

Finally, to prevent “hotspotting“, we pre-split the HBase table based on the cluster size and apply salting on the row key to better distribute writes which otherwise would have been sequential due to monotonically increasing transaction writepointer.

Performance Numbers

We’ve tested the performance on a small ten-node HBase cluster and the result is impressive. Using a 1K bytes payload with batch size of 500 events, we achieved a throughput of 100K events per second produced and consumed, running with three producers and ten consumers. We also observed the throughput increases linearly when we add more producers and consumers: for example, it increased to 200K events per second when we doubled the number of producers and consumers.

With the help of HBase and a combination of best practices, we successfully built a linearly scalable, distributed transactional queue system and used it to provide a real time stream processing framework in CDAP: dynamically scalable, strongly consistent, and with an exactly-once delivery guarantee.



Hot Blogs (today's hits)

Tag Cloud