Apache HBase

Thursday March 09, 2017

Offheap Read-Path in Production - The Alibaba story

By Yu Li (HBase Committer/Alibaba), Yu Sun (Alibaba), Anoop Sam John (HBase PMC/Intel), and Ramkrishna S Vasudevan (HBase PMC/Intel)

Introduction

HBase is the core storage system in Alibaba’s Search Infrastructure. Critical e-commerce data about products, sellers and promotions etc. are all synced into HBase from various online databases. We query HBase to build and provide real time updates on the search index. In addition, user behavior data, such as impressions, clicks and transactions will also be streamed into HBase. They serve as feature data for our online machine learning system, which optimizes the personalized search result in real time. The whole system produces mixed workloads on HBase that includes bulkload/snapshot for full index building, batch mutation for real time index updates and streaming/continuous query for online machine learning. Our biggest HBase cluster has reached more than 1500 nodes and 200,000 regions. It routinely serves tens of millions QPS.


Both latency and throughput are important for our HBase deploy. From the latency perspective, it directly affects how quickly users can search an item after it has been posted as well as how ‘real-time’ we can run our inventory accounting. From the throughput perspective, it decides the speed of machine learning program processing, and thus the accuracy of recommendations made. What’s more, since data is distributed through the cluster and accesses are balanced, applications are sensitive to latency spikes on a single node, which makes GC a critical factor in our system servicing capability.


By caching more data in memory, the read latency (and throughput) can be greatly improved. If we can get our data from local cache, we save having to make a trip to HDFS. Apache HBase has two layers of data caching. There is what we call “L1” caching, our first caching tier – which caches data in an on heap Least Recently Used (LRU) cache -- and then there is an optional, “L2” second cache tier (aka Bucket Cache).


Bucket Cache can be configured to keep its data in a file -- i.e. caching data in a local file on disk -- or in memory. File mode usually is able to cache more data but there will be more attendant latency reading from a file vs reading from memory. Bucket Cache can also be configured to use memory outside of the Java heap space (‘offheap’) so users generally configurea a large L2 cache with offheap memory along with a smaller on heap L1 cache.


At Alibaba we use an offheap L2 cache dedicating 12GB to Bucket Cache on each node. We also backported a patch currently in master branch only (to be shipped in the coming hbase-2.0.0) which makes it so the hbase read path runs offheap end-to-end. This combination improved our average throughput significantly. In the below sections, we’ll first talk about why the off-heaping has to be end-to-end, then introduce how we back ported the feature from master branch to our customized 1.1.2, and at last show the performance with end-to-end read-path offheap in an A/B test and on Singles’ Day (11/11/2016).


Necessity of End-to-end Off-heaping

Before offheap, the QPS curve looked like below from our A/B test cluster


Throughput_without_offheap(AB_Testing_450_nodes).png


We could see that there were dips in average throughput. Concurrently, the average latency would be high during these times.


Checking RegionServer logs, we could see that there were long GC pauses happening. Further analysis indicated that when disk IO is fast enough, as on PCIe-SSD, blocks would be evicted from cache quite frequently even when there was a high cache hit ratio. The eviction rate was so high that the GC speed couldn’t keep up bringing on frequent long GC pauses impacting throughput.


Looking to improve throughput, we tried the existing Bucket Cache in 1.1.2 but found GC was still heavy. In other words, although Bucket Cache in branch-1 (branch for current stable releases) already supports using offheap memory for Bucket Cache, it tends to generate lots of garbages. To understand why end-to-end off-heaping is necessary, let’s see how reads from Bucket cache work in branch-1. But before we do this, lets understand how bucket cache itself has been organized.


The allocated offheap memory is reserved as DirectByteBuffers, each of size 4 MB. So we can say that physically the entire memory area is split into many buffers each of size 4 MB.  Now on top of this physical layout, we impose a logical division. Each logical area is sized to accommodate different sized HFile blocks (Remember reads of HFiles happen as blocks and block by block it will get cached in L1 or L2 cache). Each logical split accommodates different sized HFile blocks from 4 KB to 512 KB (This is the default. Sizes are configurable). In each of the splits, there will be more that one slot into which we can insert a block. When caching, we find an appropriately sized split and then an empty slot within it and here we insert the block. Remember all slots are offheap. For more details on Bucket cache, refer here [4]. Refer to the HBase Reference Guide [5] for how to setup Bucket Cache.


In branch-1, when the read happens out of an L2 cache, we will have to copy the entire block into a temporary onheap area. This is because the HBase read path assumes block data is backed by an onheap byte array.  Also as per the above mentioned physical and logical split, there is a chance that one HFile block data is spread across 2 physical ByteBuffers.


When a random row read happens in our system, even if the data is available in L2 cache, we will end up reading the entire block -- usually ~64k in size -- into a temporary onheap allocation for every row read. This creates lots of garbage (and please note that without the HBASE-14463 fix, this copy from offheap to onheap reduced read performance a lot). Our read workload is so high that this copy produces lots of GCs, so we had to find a way to avoid the need of copying block data from offheap cache into temp onheap arrays.

How was it achieved? - Our Story

The HBASE-11425 Cell/DBB end-to-end on the read-path work in the master branch, avoids the need to copy offheap block data back to onheap when reading. The entire read path is changed to work directly off the offheap Bucket Cache area and serve data directly from here to clients (see the details of this work and performance improvement details here [1], and [2]). So we decided to try this project in our custom HBase version based on 1.1.2 backporting it from the master branch.


The backport cost us about 2 people months, including getting familiar with and analysis of the JIRAs to port, fix UT failures, fixing problems found in functional testing (HBASE-16609/16704), and resolving compatibility issues (HBASE-16626). We have listed the full to-back-port JIRA list here [3] and please refer to it for more details if interested.


About configurations, since for tables of different applications use different block sizes -- from 4KB to 512KB -- the default bucket splits just worked for our use case. We also kept the default values for other configurations after carefully testing and even after tuning while in production. Our configs are listed below:


Alibaba’s Bucket Cache related configuration

<property>

     <name>hbase.bucketcache.combinedcache.enabled</name>

     <value>true</value>

   </property>


   <property>

     <name>hbase.bucketcache.ioengine</name>

     <value>offheap</value>

   </property>


   <property>

     <name>hbase.bucketcache.size</name>

     <value>12288</value>

   </property>


   <property>

     <name>hbase.bucketcache.writer.queuelength</name>

     <value>64</value>

   </property>


   <property>

     <name>hbase.bucketcache.writer.threads</name>

     <value>3</value>

   </property>


How it works? - A/B Test and Singles’ Day

We tested the performance on our A/B test cluster (with 450 physical machines, and each with 256G memory + 64 core) after back porting and got a better throughput as illustrated below

Throughput_with_offheap(AB_Testing_450_nodes).png


It can be noted that now the average throughput graph is very much more linear and there are no more dips in throughput across time.


The version with the offheap read path feature was released on October 10th and has been online ever since (more than 4 months). Together with the NettyRpcServer patch (HBASE-15756), we successfully made it through our 2016 Singles’ Day, with peaks at 100K QPS on a single RS.


1.png

2.png



[1] https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in

[2] http://www.slideshare.net/HBaseCon/offheaping-the-apache-hbase-read-path

[3] https://issues.apache.org/jira/browse/HBASE-17138

[4] https://issues.apache.org/jira/secure/attachment/12562209/Introduction%20of%20Bucket%20Cache.pdf

[5] http://hbase.apache.org/book.html#offheap.blockcache

Thursday December 29, 2016

HGraphDB: Apache HBase As An Apache TinkerPop Graph Database

Robert Yokota is a Software Engineer at Yammer.

An earlier version of this post was published here on Robert's blog.

Be sure to also check out the excellent follow on post Graph Analytics on HBase with HGraphDB and Giraph.


HGraphDB: Apache HBase As An Apache TinkerPop Graph Database

The use of graph databases is common among social networking companies. A social network can easily be represented as a graph model, so a graph database is a natural fit. For instance, Facebook has a graph database called Tao, Twitter has FlockDB, and Pinterest has Zen. At Yammer, an enterprise social network, we rely on Apache HBase for much of our messaging infrastructure, so I decided to see if HBase could also be used for some graph modelling and analysis.

Below I put together a wish list of what I wanted to see in a graph database.

  • It should be implemented directly on top of HBase.
  • It should support the TinkerPop 3 API.
  • It should allow the user to supply IDs for both vertices and edges.
  • It should allow user-supplied IDs to be either strings or numbers.
  • It should allow property values to be of arbitrary type, including maps, arrays, and serializable objects.
  • It should support indexing vertices by label and property.
  • It should support indexing edges by label and property, specific to a given vertex.
  • It should support range queries and pagination with both vertex indices and edge indices.

I did not find a graph database that met all of the above criteria. For instance, Titan is a graph database that supports the TinkerPop API, but it is not implemented directly on HBase. Rather, it is implemented on top of an abstraction layer that can be integrated with Apache HBase, Apache Cassandra, or Berkeley DB as its underlying store. Also, Titan does not support user-supplied IDs. Apache S2Graph Incubating is a graph database that is implemented directly on HBase, and it supports both user-supplied IDs and indices on edges, but it does not yet support the TinkerPop API nor does it support indices on vertices.

This led me to create HGraphDB, a TinkerPop 3 layer for HBase. It provides support for all of the above bullet points. Feel free to try it out if you are interested in using HBase as a graph database.

Tuesday May 24, 2016

Why We Use HBase: Recap so far

Just a quick note: If you haven't seen them yet you should check out the first four entries in our "Why We Use HBase" series. More guest posts in the series on deck soon.

  1. Scalable Distributed Transactional Queues on Apache HBase

  2. Medium Data and Universal Data Systems (great read!)

  3. Imgur Notifications: From MySQL to HBase

  4. Investing In Big Data: Apache HBase

 - Andrew


Tuning G1GC For Your HBase Cluster

Graham Baecher is a Senior Software Engineer on HubSpot's infrastructure team and Eric Abbott is a Staff Software Engineer at HubSpot.

An earlier version of this post was published here on the HubSpot Product and Engineering blog.


Tuning G1GC For Your HBase Cluster

HBase is the big data store of choice for engineering at HubSpot. It’s a complicated data store with a multitude of levers and knobs that can be adjusted to tune performance. We’ve put a lot of effort into optimizing the performance and stability of our HBase clusters, and recently discovered that suboptimal G1GC tuning was playing a big part in issues we were seeing, especially with stability.

Each of our HBase clusters is made up of 6 to 40 AWS d2.8xlarge instances serving terabytes of data. Individual instances handle sustained loads over 50k ops/sec with peaks well beyond that. This post will cover the efforts undertaken to iron out G1GC-related performance issues in these clusters. If you haven't already, we suggest getting familiar with the characteristics, quirks, and terminology of G1GC first.

We first discovered that G1GC might be a source of pain while investigating frequent

“...FSHLog: Slow sync cost: ### ms...”

messages in our RegionServer logs. Their occurrence correlated very closely to GC pauses, so we dug further into RegionServer GC. We discovered three issues related to GC:

  • One cluster was losing nodes regularly due to long GC pauses.
  • The overall GC time was between 15-25% during peak hours.
  • Individual GC events were frequently above 500ms, with many over 1s.

Below are the 7 tuning iterations we tried in order to solve these issues, and how each one played out. As a result, we developed a step-by-step summary for tuning HBase clusters.

Original GC Tuning State

The original JVM tuning was based on an Intel blog post, and over time morphed into the following configuration just prior to our major tuning effort.

-Xmx32g -Xms32g

32 GB heap, initial and max should be the same

-XX:G1NewSizePercent= 3-9

Minimum size for Eden each epoch, differs by cluster

-XX:MaxGCPauseMillis=50

Optimistic target, most clusters take 100+ ms

-XX:-OmitStackTraceInFastThrow

Better stack traces in some circumstances, traded for a bit more CPU usage

-XX:+ParallelRefProcEnabled

Helps keep a lid on reference processing time issues were were seeing

-XX:+PerfDisableSharedMem

Alleged to protect against a bizarre Linux issue

-XX:-ResizePLAB

Alleged to save some CPU cycles in between GC epochs

GC logging verbosity as shown below was cranked up to a high enough level of detail for our homegrown gc_log_visualizer script. The majority of graphs in this document were created with gc_log_visualizer, while others were snapshotted from SignalFX data collected through our CollectD GC metrics plugin.

Our GC logging params:

-verbosegc
-XX:+PrintGC
-XX:+PrintGCDateStamps
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGCDetails
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintTenuringDistribution

Starting Point: Heap Sizes and Overall Time Spent in GC

With the highly detailed GC logs came the following chart of heap state. Eden size is in red and stays put at its minimum (G1NewSizePercent), 9% of total heap. Tenured size, or Working set + waste, floats in a narrow band between 18-20gb. With Eden a flat line, the total heap line will mirror the Tenured line, just 9% higher.

The black horizontal line just under 15GB marks the InitiatingHeapOccupancyPercent (aka “IHOP”), at its default setting of 45%. The purple squares are the amount of Tenured space reclaimable at the start of a mixed GC event. The floor of the band of purple squares is 5% of heap, the value of G1HeapWastePercent.

The next graph shows a red “+” on each minute boundary and stands for the total percent of wall time the JVM was in STW and doing no useful work. The overall time spent in GC for this HBase instance for this time period is 15-25%. For reference, an application tier server spending 20%+ time in GC is considered stuck in “GC Hell” and in desperate need of tuning.

Tuning #1 Goal: Lower Total Time in GC - Action: Raise IHOP

One aspect that stands out clearly in the previous graph of heap sizes is that the working set is well above the IHOP. Tenured being higher than IHOP generally results in an overabundance of MPCMC runs (wastes CPU) and consequently an overabundance of Mixed GC cycles resulting in a higher ratio of expensive GC events vs cheap GC events. By moving IHOP a bit higher than Tenured, we expect fewer Mixed GC events to reclaim larger amounts of space, which should translate to less overall time spent in STW.

Raising the IHOP value on an HBase instance, the following before/after (above/below) graphs show that indeed the frequency of Mixed GC events drops dramatically while the reclaimable amount rises.

Considering that at least half of Mixed GC events on this instance took 200-400ms, we expected the reduced amount of Mixed GC events to outweigh any increase in individual Mixed GC times, such that overall GC time would drop. That expectation held true, as overall time spent in GC dropped from 4-12% to 1-8%.

The following graphs show before/after on the STW times for all Mixed GC events. Note the drastic drop in frequency while individual STW times don't seem to change.

Result: Success

This test was considered successful. We made the change across all HBase clusters to use a significantly larger IHOP value than the default of 45%.

Tuning #2 Goal: Lower Total Time in GC - Action: Increase Eden

Fixing the IHOP value to be higher than working set was basically fixing a misconfiguration. There was very little doubt that nonstop MPCMC + Mixed GC events was an inefficient behavior. Increasing Eden size, on the other hand, had a real possibility of increasing all STW times, both Minor and Mixed. GC times are driven by the amount of data being copied (surviving) from one epoch to the next, and databases like HBase are expected to have very large caches. A 10+ GB cache could very well have high churn and thus high object survival rates.

The effective Eden size for our HBase clusters is driven by the minimum Eden value G1NewSizePercent because the MaxGCPauseMillis target of 50ms is never met.

For this test, we raised Eden from 9% to 20% through G1NewSizePercent.

Effects on Overall GC Time

Looking at the following graphs, we see that overall time spent in GC may have dropped a little for this one hour time window from one day to the next.

Individual STW times

Looking at the STW times for just the Minor GC events there is a noticeable jump in the floor of STW times.

To-space Exhaustion Danger

As mentioned in the G1GC Foundational blog post, G1ReservePercent is ignored when the minimum end of the Eden range is used. The working set on a few of our HBase clusters is in the 70-75% range, which combined with a min Eden of 20% would leave only 5-10% of heap free for emergency circumstances. The downside of running out of free space, thus triggering To-space Exhaustion, is a 20+ sec GC pause and the effective death of the HBase instance. While the instance would recover, the other HBase instances in the cluster would consider it long dead before the GC pause completed.

Result: Failure

The overall time spent in GC did drop a little as theoretically expected, unfortunately the low end of Minor GC stop the world times increased by a similar percent. In addition, the risk for To-space exhaustion increased. The approach of increasing G1NewSizePercent to reduce overall time spent in GC didn't look promising and was not rolled out to any clusters.

Tuning #3 Goal: Reduce Individual STW Durations - Action: SurvivorRatio and MaxTenuringThreshold

In the previous tuning approach, we found that STW times increased as Eden size increased. We took some time to dig a little deeper into Survivor space to determine if there was any To-space overflow or if objects could be promoted faster to reduce the amount of object copying being done. To collect the Survivor space tenuring distribution data in the GC logs we enabled PrintTenuringDistribution and restarted a few select instances.

To-space overflow is the phenomenon where the Survivor To space isn't large enough to fit all the live data in Eden at the end of an epoch. Any data collected after Survivor To is full is added to Free regions, which are then added to Tenured. If this overflow is transient data, putting it in Tenured is inefficient as it will be expensive to clean out. If that was the case, we'd need to increase SurvivorRatio.

On the other hand, consider a use case where any object that survives two epochs will also survive ten epochs. In that case, by ensuring that any object that survives a second epoch is immediately promoted to Tenured, we would see a performance improvement since we wouldn’t be copying it around in the GC events that followed.

Here’s some data collected from the PrintTenuringDistribution parameter:

Desired survivor size 192937984 bytes, new threshold 2 (max 15)
- age 1: 152368032 bytes, 152368032 total
- age 2: 147385840 bytes, 299753872 total
[Eden: 2656.0M(2656.0M)->0.0B(2624.0M) Survivors: 288.0M->320.0M Heap: 25.5G(32.0G)->23.1G(32.0G)]

An Eden size of 2656 MB with SurvivorRatio = 8 (default) yields a 2656/8 = 332 MB survivor space. In the example entries we see enough room to hold two ages of survivor objects. The second age here is 5mb smaller than the first age, indicating that in the interval between GC events, only 5/152 = 3.3% of the data was transient. We can reasonably assume the other 97% of the data is some kind of caching. By setting MaxTenuringThreshold = 1, we optimize for the 97% of cached data to be promoted to Tenured after surviving its second epoch and hopefully shave a few ms of object copy time off each GC event.

Result: Theoretical Success

Unfortunately we don't have any nice graphs available to show these effects in isolation. We consider the theory sound and rolled out MaxTenuringThreshold = 1 to all our clusters.

Tuning #4 Goal: Eliminate Long STW Events - Action: G1MixedGCCountTarget & G1HeapWastePercent

Next, we wanted to see what we could do about eliminating the high end of Mixed GC pauses. Looking at a 5 minute interval of our Mixed GC STW times, we saw a pattern of sharply increasing pauses across each cycle of 6 mixed GCs:

That in and of itself should not be considered unusual, after all that behavior is how the G1GC algorithm got it's name. Each Mixed GC event will evacuate 1 / G1MixedGCCountTarget of the high-waste regions (regions with the most non-live data). Since it prioritizes regions with the most garbage, each successive Mixed GC event will be evicting regions with more and more live objects. The chart shows the performance effects of clearing out regions with more and more live data: the Mixed event times start at 100ms at the beginning of a mixed GC cycle and range upwards past 600ms by the end.

In our case, we were seeing occasional pauses at the end of some cycles that were several seconds long. Even though they were rare enough that our average pause time was reasonable, pauses that long are still a serious concern.

Two levers in combination can be used together to lessen the “scraping the bottom of the barrel” effect of cleaning up regions with a lot of live data:

G1HeapWastePercent: default (5) → 10. Allow twice as much wasted space in Tenured. Having 5% waste resulted in 6 of the 8 potential Mixed GC events occurring in each Mixed GC cycle. Bumping to 10% waste should chop off 1-2 more of the most expensive events of the Mixed GC cycle.

G1MixedGCCountTarget: default (8) → 16. Double the target number of Mixed GC events each Mixed GC cycle, but halve the work done by each GC event. Though it’s an increase to the number of GC events that are Mixed GC events, STW times of individual Mixed events should drop noticeably.

In combination, we expected doubling the target count to drop the average Mixed GC time, and increasing the allowable waste to eliminate the most expensive Mixed GC time. There should be some synergy, as more heap waste should also mean regions are on average slightly less live when collected.

Waste heap values of 10% and 15% were both examined in a test environment. (Don’t be alarmed by the high average pause times - this test environment was running under heavy load, on less capable hardware than our production servers.)

Above: 10% heap waste; below: 15% heap waste:

The results are very similar. 15% performed slightly better, but in the end we decided that 15% waste was unnecessarily much. 10% was enough to clear out the "scraping the bottom of the barrel" effect such that the 1+ sec Mixed GC STW times all but disappeared in production.

Result: Success

Doubling G1MixedGCCountTarget from 8 to 16 and G1HeapWastePercent from 5 to 10 succeeded in eliminating the frequent 1s+ Mixed GC STW times. We kept these changes and rolled them out across all our clusters.

Tuning #5 Goal: Stop Losing Nodes: Heap Size and HBase Configs

While running load tests to gauge the effects of the parameters above, we also began to dig into what looked like evidence of memory leaks in a production cluster. In the following graph we see the heap usage slowly grow over time until To-space Exhaustion, triggering a Full GC with a long enough pause to get the HBase instance booted from the cluster and killed:

We've got several HBase clusters, and only one cluster occasionally showed this type of behavior. If this issue were a memory leak, we'd expect the issue to arise more broadly, so it looks like HBase is using more memory in this cluster than we expected. To understand why, we looked into the heap breakdown of our RegionServers. The vast majority of an HBase RegionServer’s Tenured space is allocated to three main areas:

  • Memstore: region server’s write cache; default configuration caps this at 40% of heap.
  • Block Cache: region server’s read cache; default config caps at 40% of heap.
  • Overhead: the vast majority of HBase’s in-memory overhead is contained in a “static index”. The size of this index isn’t capped or configurable, but HBase assumes this won’t be an issue since the combined cap for memstore and block cache can’t exceed 80%.

We have metrics for the size of each of these, from the RegionServer’s JMX stats: “memStoreSize,” “blockCacheSize,” and “staticIndexSize.” The stats from our clusters show that HBase will use basically all of the block cache capacity you give it, but memstore and static index sizes depend on cluster usage and tables. Memstore fluctuates over time, while static index size depends on the RegionServer’s StoreFile count and average row key size.

It turned out, for the cluster in question, that the HBase caches and overhead combined were actually using more space than our JVM was tuned to handle. Not only were memstore and block cache close to capacity—12 GB block cache, memstore rising past 10GB - but the static index size was unexpectedly large, at 6 GB. Combined, this put desired Tenured space at 28+ GB, while our IHOP was set at 24 GB, so the upward trend of our Tenured space was just the legitimate memory usage of the RegionServer.

With this in mind, we judged the maximum expected heap use for each cluster’s RegionServers by looking at the cluster maximum memstore size, block cache size, and static index size over the previous month, and assuming max expected usage to be 110% of each value. We then used that number to set the block cache and memstore size caps (hfile.block.cache.size and hbase.regionserver.global.memstore.size) in our HBase configs.

The fourth component of Tenured space is the heap waste, in our case 10% of the heap size. We could now confidently tune our IHOP threshold by summing the max expected memstore, block cache, static index size, 10% heap for heap waste, and finally 10% more heap as a buffer to avoid constant mixed GCs when working set is maxed (as described in Tuning #1).

However, before we went ahead and blindly set this value, we had to consider the uses of heap other than Tenured space. Eden requires a certain amount of heap (determined by G1NewSizePercent), and a certain amount (default 10%) is Reserved free space. IHOP + Eden + Reserved must be ≤ 100% in order for a tuning to make sense; in cases where our now-precisely-calculated IHOP was too large for this to be possible, we had to expand our RegionServer heaps. To determine minimum acceptable heap size, assuming 10% Reserved space, we used this formula:

Heap ≥ (M + B + O + E) ÷ 0.7

    • M = max expected memstore size
    • B = max expected block cache size
    • O = max expected overhead (static index)
    • E = minimum Eden size

When those four components add up to ≤ 70% of the heap, then there will be enough room for 10% Reserved space, 10% heap waste, and 10% buffer between max working set and IHOP.

Result: Success

We reviewed memory usage of each of our clusters and calculated correct heap sizes and IHOP thresholds for each. Rolling out these changes immediately ended the To-space Exhaustion events we were seeing on the problematic cluster.

Tuning #6 Goal: Eliminate Long STW Events - Action: Increase G1 Region Size

We’d rolled out HBase block cache & memstore config changes, changes to G1HeapWastePercent and G1MixedGCCountTarget, and an increase in heap size on a couple clusters (32 GB → 40+ GB) to accommodate larger IHOP. In general things were smooth, but there were still occasional Mixed GC events taking longer than we were comfortable with, especially on the clusters whose heap had increased. Using gc_log_visualizer, we looked into what phase of Mixed GC was the most volatile and noticed that Scan RS times correlated:

A few Google searches indicated that Scan RS time output in the GC logs is the time taken examining all the regions referencing the tenured regions being collected. In our most recent tuning changes, heap size had been bumped up, however the G1HeapRegionSize remained fixed at 16 MB. Increasing the G1HeapRegionSize to 32 MB eliminated those high scan times:

Result: Success

Halving the G1 region count cleared out the high volatility in Scan RS times. According to G1GC documentation, the ideal region count is 2048, so 16 MB regions were perfect for a 32 GB heap. However, this tuning case led us to believe that for HBase heaps without a clear choice of region size, in our case 40+ GB, it’s much better to err on the side of fewer, larger regions.

Tuning #7 Goal: Preventing RegionServer To-space Exhaustion - Action: Extra Heap as Buffer

At this point, our RegionServers were tuned and running much shorter and less frequent GC Events. IHOP rested above Tenured while Tenured + Eden remained under the target of 90% total heap. Yet once in awhile, a RegionServer would still die from a To-space exhaustion triggered Full GC event as shown in the following graph.

It looks like we did everything right—there’s lot’s of reclaimable space and Tenured space drops well below IHOP with each Mixed GC. But right at the end, heap usage spikes up and we hit To-space Exhaustion. And while it’s likely the HBase client whose requests caused this problem could be improved to avoid this*, we can’t rely on our various HBase clients to behave perfectly all the time.

In the scenario above, very bursty traffic caused Tenured space to fill up the heap before the MPCMC could complete and enable a Mixed GC run. To tune around this, we simply added heap space**, while adjusting IHOP and G1NewSizePercent down to keep them at the same GB values they had been at before. By doing this we increased the buffer of free space in the heap above our original 10% default, for additional protection against spikes in memory usage.

Result: Success

Increasing heap buffer space on clusters whose HBase clients are known to be occasionally bursty has all but eliminated Full GC events on our RegionServers.

Notes:

* Block cache churn correlates very closely with time spent in Mixed GC events on our clusters (see chart below). A high volume of Get and Scan requests with caching enabled unnecessarily (e.g. requested data isn’t expected to be in cache and isn’t expected to be requested again soon) will increase cache churn as data is evicted from cache to make room for the Get/Scan results. This will raise the RegionServer’s time in GC and could contribute to instability as described in this section.

Chart: % time in Mixed GC is in yellow (left axis); MB/sec cache churn is in blue (right axis):

** ** Another potential way to tune around this issue is by increasing ConcGCThreads (default is 1/4 * ParallelGCThreads). ConcGCThreads is the number of threads used to do the MPCMC, so increasing it could mean the MPCMC finishes sooner and the RegionServer can start a Mixed GC before Tenured space fills the heap. At HubSpot we’ve been satisfied with the results of increasing our heap size and haven’t tried experimenting with this value.

Overall Results: Goals Achieved!

After these cycles of debugging and tuning G1GC for our HBase clusters, we’ve improved performance in all the areas we were seeing problems originally:

  • Stability: no more To-space Exhaustion events causing Full GCs.
  • 99th percentile performance: greatly reduced frequency of long STW times.
  • Avg. performance: overall time spent in GC STW significantly reduced.

Summary: How to Tune Your HBase Cluster

Based on our findings, here’s how we recommend you tune G1GC for your HBase cluster(s):

Before you start: GC and HBase monitoring

  • Track block cache, memstore and static index size metrics for your clusters in whatever tool you use for charts and monitoring, if you don’t already. These are found in the RegionServer JMX metrics:
    • memStoreSize
    • blockCacheSize
    • staticIndexSize”  

  • You can use our collectd plugin to track G1GC performance over time, and our gc_log_visualizer for insight on specific GC logs. In order to use these you’ll have to log GC details on your RegionServers:
    • -Xloggc:$GC_LOG_PATH
    • -verbosegc
    • -XX:+PrintGC
    • -XX:+PrintGCDateStamps
    • -XX:+PrintAdaptiveSizePolicy
    • -XX:+PrintGCDetails
    • -XX:+PrintGCApplicationStoppedTime
    • -XX:+PrintTenuringDistribution
    • Also recommended is some kind of GC log rotation, e.g.:
      • -XX:+UseGCLogFileRotation
      • -XX:NumberOfGCLogFiles = 5
      • -XX:GCLogFileSize=20M

Step 0: Recommended Defaults

  • We recommend the following JVM parameters and values as defaults for your HBase RegionServers (as explained in Original GC Tuning State):
    • -XX:+UseG1GC
    • -XX:+UnlockExperimentalVMOptions
    • -XX:MaxGCPauseMillis = 50
      • This value is intentionally very low and doesn’t actually represent a pause time upper bound. We recommend keeping it low to pin Eden to the low end of its range (see Tuning #2).
    • -XX:-OmitStackTraceInFastThrow
    • -XX:ParallelGCThreads = 8+(logical processors-8)(5/8)
    • -XX:+ParallelRefProcEnabled
    • -XX:+PerfDisableSharedMem
    • -XX:-ResizePLAB 

Step 1: Determine maximum expected HBase usage

  • As discussed in the Tuning #5 section, before you can properly tune your cluster you need to know your max expected block cache, memstore, and static index usage.
    • Using the RegionServer JMX metrics mentioned above, look for the maximum value of each metric across the cluster:
      • Maximum block cache size.
      • Maximum memstore size.
      • Maximum static index size.
    • Scale each maximum by 110%, to accommodate even for slight increase in max usage. This is your usage cap for that metric: e.g. a 10 GB max recorded memstore → 11 GB memstore cap.
      • Ideally, you’ll have these metrics tracked for the past week or month, and you can find the maximum values over that time. If not, be more generous than 110% when calculating memstore and static index caps. Memstore size especially can vary widely over time. 

Step 2: Set Heap size, IHOP, and Eden size

  • Start with Eden size relatively low: 8% of heap is a good initial value if you’re not sure.
    • -XX:G1NewSizePercent = 8
    • See Tuning #2 for more about Eden sizing.
      • Increasing Eden size will increase individual GC pauses, but slightly reduce overall % time spent in GC.
      • Decreasing Eden size will have the opposite effect: shorter pauses, slightly more overall time in GC.

  • Determine necessary heap size, using Eden size and usage caps from Step 1:
    • From Tuning #5: Heap ≥ (M + B + O + E) ÷ 0.7
      • M = memstore cap, GB
      • B = block cache cap, GB
      • O = static index cap, GB
      • E = Eden size, GB
    • Set JVM args for fixed heap size based on the calculated value, e.g:
      • -Xms40960m -Xmx40960m

  • Set IHOP in the JVM, based on usage caps and heap size:
    • IHOP = (memstore cap’s % heap + block cache cap’s % heap + overhead cap’s % heap + 20)
    • -XX:InitiatingHeapOccupancyPercent IHOP

Step 3: Adjust HBase configs based on usage caps

  • Set block cache cap and memstore cap ratios in HBase configs, based on usage caps and total heap size. In hbase-site.xml:
    • hfile.block.cache.size → block cache cap ÷ heap size
    • hbase.regionserver.global.memstore.size → memstore cap ÷ heap size 

Step 4: Adjust additional recommended JVM flags for GC performance

  • From Tuning #3:
    • -XX:MaxTenuringThreshold = 1

  • From Tuning #4:
    • -XX:G1HeapWastePercent = 10
    • -XX:G1MixedGCCountTarget = 16

  • From Tuning #6:
    • -XX:G1HeapRegionSize = #M
    • # must be a power of 2, in range [1..32].
    • Ideally, # is such that: heap size ÷ # MB = 2048 regions.
    • If your heap size doesn’t provide a clear choice for region size, err on the side of fewer, larger regions. Larger regions reduce GC pause time volatility.

Step 5: Run it!

  •  Restart your RegionServers with these settings and see how they look.

    • Remember that you can adjust Eden size as described above, to optimize either for shorter individual GCs or for less overall time in GC. If you do, make sure to maintain Eden + IHOP ≤ 90%.

    • If your HBase clients can have very bursty traffic, consider adding heap space outside of IHOP and Eden (so that IHOP + Eden adds up to 80%, for example).

      • Remember to update % and ratio configs along with the new heap size.

      • Details and further suggestions about this found in Tuning #7. 

Further reference:

Saturday April 23, 2016

HDFS HSM and HBase: Conclusions (Part 7 of 7)

This is part 7 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)  

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Conclusions

There are many things to consider when choosing the hardware of a cluster. According to the test results, in the SSD-related cases the network utility between DataNodes is larger than 10Gbps. If you are using a 10Gbps switch, the network will be the bottleneck and impact the performance. We suggest either extending the network bandwidth by network bonding, or upgrading to a more powerful switch with a higher bandwidth. In cases 1T_HDD and 1T_RAM_HDD, the network utility is lower than 10 Gbps in most time, using a 10 Gbps switch to connect DataNodes is fine.

In all 1T dataset tests, 1T_RAM_SSD shows the best performance. Appropriate mix of different types of storage can improve the HBase write performance. First, write the latency-sensitive and blocking data to faster storage, and write the data that are rarely compacted and accessed to slower storage. Second, avoid mixing types of storage with a large performance gap, such as with 1T_RAM_HDD.

The hardware design issue limits the total disk bandwidth which makes there is hardly superiority of eight SSDs than four SSDs. Either to enhance hardware by using HBA cards to eliminate the limitation of the design issue for eight SSDs or to mix the storage appropriately. According to the test results, in order to achieve a better balance between performance and cost, using four SSDs and four HDDs can achieve a good performance (102% throughput and 101% latency of eight SSDs) with a much lower price. The RAMDISK/SSD tiered storage is the winner of both throughput and latency among all the tests, so if cost is not an issue and maximum performance is needed, RAMDISK(extremely high speed block device, e.g. NVMe PCI-E SSD)/SSD should be chosen.

You should not use a large number of flusher/compactor when most of data are written to HDD. The read and write shares the single channel per HDD, too many flushers and compactors at the same time can slow down the HDD performance.

During the tests, we found some things that can be improved in both HBase and HDFS.

In HBase, the memstore is consumed quickly when the WALs are stored in fast storage; this can lead to regular long GC pauses. It is better to have an offheap memstore for HBase.

In HDFS, each DataNode shares the same lock when creating/finalizing blocks. Any such slow operations in one DataXceiver can block any other operations of creating/finalizing blocks in other DataXceiver on the same DataNode no matter what storage they are using. We need to eliminate the blocking access across storage, and a finer grained lock mechanism to isolate the operations on different blocks is needed (HDFS-9668). And it will be good to implement a latency-aware VolumeChoosingPolicy in HDFS to remove the slow volumes from the candidates.

RoundRobinVolumeChoosingPolicy can lead to load imbalance in HDFS with tiered storage (HDFS-9608).

In HDFS, renaming a file to a different storage does not move the blocks indeed. We need to asynchronously move the HDFS blocks in such a case.

Acknowledgements

The authors would like to thank Weihua Jiang  – who is the previous manager of the big data team in Intel – for leading this performance evaluation, and thank Anoop Sam John(Intel), Apekshit Sharma(Cloudera), Jonathan Hsieh(Cloudera), Michael Stack(Cloudera), Ramkrishna S. Vasudevan(Intel), Sean Busbey(Cloudera) and Uma Gangumalla(Intel) for the nice review and guidance.

HDFS HSM and HBase: Issues (Part 6 of 7)

This is part 6 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Issues and Improvements

This section describes the issues we find during tests.

Software

Long-time BLOCKED threads in DataNode

In 1T_RAM_HDD test, we can observe a substantial 0 throughput period in the YCSB client. After deep-diving into the thread stack, we find many threads of DataXceiver are stuck in a BLOCKED state for a long time in DataNode. We can observe such things in other cases too, but it is most often in 1T_RAM_HDD.

In each DataNode, there is a single instance of FsDatasetImpl where there are many synchronized methods, DataXceiver threads use this instance to achieve synchronization when creating/finalizing blocks. A slow creating/finalizing operation in one DataXceiver thread can block other creating/finalizing operations in all other DataXceiver threads. The following table shows the time consumed by these operations in 1T_RAM_HDD:

Synchronized methods

Max exec time (ms)

in light load

Max wait time (ms)

in light load

Max exec time (ms)

in heavy load

Max wait time (ms)

in heavy load

finalizeBlock

0.0493

0.1397

17652.9761

21249.0423

addBlock

0.0089

0.4238

15685.7575

57420.0587

createRbw

0.0480

1.7422

21145.8033

56918.6570

Table 13. DataXceiver threads and time consumed

We can see that both execution time and wait time on the synchronized methods increased dramatically along with the increment of the system load. The time to wait for locks can be up to tens of seconds. Slow operations usually come from the slow storage such as HDD. It can hurt the concurrent operations of creating/finalizing blocks in fast storage so that HDFS/HBase cannot make better use of tiered storage.

A finer grained lock mechanism in DataNode is needed to fix this issue. We are working on this improvement now (HDFS-9668).

Load Imbalance in HDFS with Tiered Storage

In tiered storage cases, we find that the utilization are not the same among volumes of the same storage type when using the policy RoundRobinVolumeChoosingPolicy. The root cause is that in RoundRobinVolumeChoosingPolicy it uses a shared counter to choose volumes for all storage types. It might be unfair when choosing volumes for a certain storage type, so the volumes in the tail of the configured data directories have a lower chance to be written.

The situation becomes even worse when there are different numbers of volumes of different storage types. We have filed this issue in JIRA (HDFS-9608) and provided a patch.

Asynchronous File Movement Across Storage When Renaming in HDFS

Currently data blocks in HDFS are stored in different types of storage media according to pre-specified storage policies when creating. After that data blocks will remain where they were until an external tool Mover in HDFS is used. Mover scans the whole namespace, and moves the data blocks that are not stored in the right storage media as the policy specifies.

In a tiered storage, when we rename a file/directory from one storage to another different one, we have to move the blocks of that file or all files under that directory to the right storage. This is not currently provided in HDFS.

Non-configurable Storage Type and Policy

Currently in HDFS both storage type and storage policy are predefined in source code. This makes it inconvenient to add implementations for new devices and policies. It is better to make them configurable.

No Optimization for Certain Storage Type

Currently there is no difference in the execution path for different storage types. As more and more high performance storage devices are adopted, the performance gap between storage types will become larger, and the optimization for certain types of storage will be needed.

Take writing certain numbers of data into HDFS as an example. If users want to minimize the total time to write, the optimal way for HDD may be using compression to save disk I/O, while for RAMDISK writing directly is more suitable as it eliminates the overheads of compression. This scenario requires configurations per storage type, but it is not supported in the current implementation.

Hardware

Disk Bandwidth Limitation

In the section 50GB Dataset in a Single Storage, the performance difference between four SSDs and eight SSDs is very small. The root cause is the total bandwidth available for the eight SSDs is limited by upper level hardware controllers. Figure 16 illustrates the motherboard design. The eight disk slots connect to two different SATA controllers (Ports 0:3 - SATA and Port 0:3 - sSATA). As highlighted in the red rectangle, the maximum bandwidth available for the two controller in the server is 2*6 Gbps = 1536 MB/s.

Figure 16. Hardware design of server motherboard

Maximum throughput for single disk is measured with FIO tool.


Read BW (MB/s)

Write BW (MB/s)

HDD

140

127

SSD

481

447

RAMDISK

>12059

>11194

Table 14. Maximum throughput of storage medias

Note: RAMDISK is memory essentially and it does not go through the same controller as SSDs and HDDs, so it does not have the 2*6Gbps limitation. Data of RAMDISK is listed in the table for convenience of comparison.

According to Table 14 the writing bandwidth of eight SSDs is 447 x 8 = 3576 MB/s. It exceeds the controllers’ 1536 MB/s physical limitation, thus only 1536 MB/s are available for all eight SSDs. HDD is not affected by this limitation as their total bandwidth (127 x 8 = 1016 MB/s) is below the limitation. This fact greatly impacts the performance of the storage system.

We suggest one of the following:

  • Enhance hardware by using more HBA cards to eliminate the limitation of the design issue.

  • Use SSD and HDD together with an appropriate ratio (for example four SSDs and four HDDs) to achieve a better balance between performance and cost.

Disk I/O Bandwidth and Latency Varies for Ports

As described in section Disk Bandwidth Limitation, four of the eight disks connect to Ports 0:3 - SATA and the rest of them connect to Port 0:3 - sSATA, the total bandwidth of the two controllers is 12Gbps. We find that the bandwidth is not evenly divided to the disk channels.

We do a test, each SSD (sda, sdb, sdc and sdd connect to Port 0:3 - sSATA, sde, sdf, sdg and sdh connect to Ports 0:3 - SATA) is written by an individual FIO process. It’s expected that eight disks are written at the same speed, but according to the output of IOSTAT the bandwidth 1536 MB/s is not evenly divided to two controllers and eight disk channels. As shown in Figure 17, the four SSDs connected to Ports 0:3 - SATA obtain more I/O bandwidth (213.5MB/s*4) than the others (107MB/s*4).

We suggest that you consider the controller limitation and storage bandwidth when setting up a cluster. Using four SSDs and four HDDs in a node is a reasonable choice, and it is better to install the four SSDs to Port 0:3 - SATA.

Additionally, the disk with higher latency might take the same workload as the disks with lower latency in the existing VolumeChoosingPolicy. This would slow down the performance. We suggest to implement a latency-aware VolumeChoosingPolicy in HDFS.

Figure 17. Different write speed and await time of disks

Go to part 7, Conclusions

HDFS HSM and HBase: Experiment (continued) (Part 5 of 7)

This is part 5 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

1TB Dataset in a Single Storage

The performance for 1TB dataset in HDD and SSD is shown in Figure 6 and Figure 7. Due to the limitation of memory capability, 1TB dataset in RAMDISK is not tested.

Figure 6. YCSB throughput of a single storage type with 1TB dataset

Figure 7. YCSB latency of a single storage type with 1TB dataset

The throughput and latency on SSD are both better than HDD (134% throughput and 35% latency). This is consistent with 50GB data test.

The benefits gained for throughput by using SSD are different between 50GB and 1TB (from 128% to 134%), SSD gains more benefits in the 1TB test. This is because much more I/O intensive events such as compactions occur in 1TB dataset test than 50GB, and this shows the superiority of SSD in huge data scenarios. Figure 8 shows the changes of the network throughput during the tests.

Figure 8. Network throughput measured for case 1T_HDD and 1T_SSD

In 1T_HDD case the network throughput is lower than 10Gbps, and in 1T_SSD case the network throughput can be much larger than 10Gbps. This means if we use a 10Gbps switch in 1T_SSD case, the network should be the bottleneck.

Figure 9. Disk throughput measured for case 1T_HDD and 1T_SSD

In Figure 9, we can see the bottleneck for these two cases is disk bandwidth.

  • In 1T_HDD, at the beginning of the test the throughput is almost 1000 MB/s, but after a while the throughput drops down due to memstore limitation of regions caused by slow flush.

  • In 1T_SSD case, the throughput seems to be limited by a ceiling of around 1300 MB/s, nearly the same with the bandwidth limitation of SATA controllers. To further improve the throughput, more SATA controllers are needed (e.g. using HBA card) instead of more SSDs are needed.

During 1T_SSD test, we observe that the operation latencies on eight SSDs per node are very different as shown in the following chart. In Figure 10, we only include latency of two disks, sdb represents disks with a high latency and sdf represents disks with a low latency.

Figure 10. I/O await time measured for different disks

Four of them have a better latency than the other ones. This is caused by the hardware design issue. You can find the details in Disk I/O Bandwidth and Latency Varies for Ports. The disk with higher latency might take the same workload as the disks with lower latency in the existing VolumeChoosingPolicy, this would slow down the performance. We suggest to implement a latency-aware VolumeChoosingPolicy in HDFS.

Performance Estimation for RAMDISK with 1TB Dataset

We cannot measure the performance of RAMDISK with 1T dataset due to RAMDISK limited capacity. Instead we have to evaluate its performance by analyzing the results of cases HDD and SSD.

The performance between 1TB and 50GB dataset are pretty close in HDD and SSD.

The throughput difference between 50GB and 1TB dataset for HDD is

|242801250034-1|×100%=2.89%

While for SSD the value is

|325148320616-1|×100%=1.41%

If we make an average of the above values as the throughput decrease in RAMDISK between 50GB and 1TB dataset, it is around 2.15% ((2.89%+1.41%)/2=2.15%), thus the throughput for RAMDISK with 1T dataset should be

406577×(1+2.15%)=415318 (ops/sec)


Figure 11.  YCSB throughput estimation for RAMDISK with 1TB dataset

Please note: the throughput doesn’t drop much in 1 TB dataset cases compared to 50 GB dataset cases because they do not use the same number of pre-split regions. The table is pre-split to 18 regions in 50 GB dataset cases, and it is pre-split to 210 regions in the 1 TB dataset.

Performance for Tiered Storage

In this section, we will study the HBase write performance on tiered storage (i.e. different storage mixed together in one test). This would show what performance it can achieve by mixing fast and slow storage together, and help us to conclude the best balance of storage between performance and cost.

Figure 12 and Figure 13 show the performance for tiered storage. You can find the description of each case in Table 1.

Most of the cases that introduce fast storage have better throughput and latency. With no surprise, 1T_RAM_SSD has the best performance among them. The real surprise is that the throughput of 1T_RAM_HDD is worse than 1T_HDD (-11%) and 1T_RAM_SSD_All_HDD is worse than 1T_SSD_All_HDD (-2%) after introducing RAMDISK, and 1T_SSD is worse than 1T_SSD_HDD (-2%).

Figure 12.  YCSB throughput data for tiered storage

Figure 13.  YCSB latency data for tiered storage

We also investigate how much data is written to different storage types by collecting information from one DataNode.

Figure 14. Distribution of data blocks on each storage of HDFS in one DataNode

As shown in Figure 14, generally, more data are written to disks for test cases with higher throughput. Fast storage can accelerate the flush and compaction, which lead to more flushes and compactions.  Thus, more data are written to disks. In some RAMDISK-related cases, only WAL can be written to RAMDISK, and there are 1216 GB WALs written to one DataNode.

For tests without SSD (1T_HDD and 1T_RAM_HDD), we by purpose limiting the number of flush and compaction actions by using fewer flushers and compactors. This is due to limited IOPs capability of HDD, which lead to fewer flush & compactions. Too many concurrent reads and writes can hurt HDD performance which eventually slows down the performance.

Many BLOCKED DataNode threads can be blocked up to tens of seconds in 1T_RAM_HDD. We observe this in other cases as well, but it happens most often in 1T_RAM_HDD. This is because each DataNode holds one big lock when creating/finalizing HDFS blocks, these methods might take tens of seconds sometimes (see Long-time BLOCKED threads in DataNode), the more these methods are used (in HBase they are used in flusher, compactor, and WAL), the more often the BLOCKED occurs. Writing WAL in HBase needs to create/finalize blocks which can be blocked, and consequently users’ inputs are blocked. Multiple WAL with a large number of groups or WAL per region might also encounter this problem, especially in HDD.

With the written data distribution in mind, let’s look back at the performance result in Figure 12 and Figure 13. According to them, we have following observations:

  1. Mixing SSD and HDD can greatly improve the performance (136% throughput and 35% latency) compared to pure HDD. But fully replacing HDD with SSD doesn’t show an improvement (98% throughput and 99% latency) over mixing SSD/HDD. This is because the hardware design cannot evenly split the I/O bandwidth to all eight disks, and 94% data are written in SSD while only 6% data are written to HDD in SSD/HDD mixing case. This strongly hints a mix use of SSD/HDD can achieve the best balance between performance and cost. More information is in Disk Bandwidth Limitation and Disk I/O Bandwidth and Latency Varies for Ports.

  2. Including RAMDISK in SSD/HDD tiered storage has different results with 1T_RAM_SSD_All_HDD and 1T_RAM_SSD_HDD. The case 1T_RAM_SSD_HDD shows a result when there are only a few data written to HDD, which improves the performance over SSD/HDD mixing cases. The results of 1T_RAM_SSD_All_HDD when there are a large number of data written to HDD is worse than SSD/HDD mixing cases. This means if we distribute the data appropriately to SSD and HDD in HBase, we can gain a good performance when mixing RAMDISK/SSD/HDD.

  3. The RAMDISK/SSD tiered storage is the winner of both throughput and latency (109% throughput and 67% latency of pure SSD case). So, if cost is not an issue and maximum performance is needed, RAMDISK/SSD should be chosen.

The throughput decreases by 11% by comparing 1T_RAM_HDD to 1T_HDD. This is initially because 1T_RAM_HDD uses RAMDISK which consumes part of the RAM, which results in the OS buffer having less memory to cache the data.

Further, with 1T_RAM_HDD, the YCSB client can push data at very high speed, cells are accumulated very fast in memstore while the flush and compaction in HDD are slow, the RegionTooBusyException occurs more often (the figure below shows a much larger memstore in 1T_RAM_HDD than 1T_HDD), and we observe much longer GC pause in 1T_RAM_HDD than 1T_HDD, it can be up to 20 seconds in a minute.

Figure 15. Memstore size in 1T_RAM_HDD and 1T_HDD

Finally, as we try to increase the number of flushers and compactors, the performance even goes worse because of the reasons mentioned when explaining why we use less flusher and compactors in HDD-related tests (see Long-time BLOCKED threads in DataNode).

The performance reduction in 1T_RAM_SSD_All_HDD than 1T_SSD_All_HDD (-2%) is due to the same reasons mentioned above.

We suggest:

Implement a finer grained lock mechanism in DataNode.

  1. Use reasonable configurations for flusher and compactor, especially in HDD-related cases.

  2. Don’t use the storage that has large performance gaps, such as directly mixing RAMDISK and HDD together.

  3. In many cases, we can observe the long GC pause around 10 seconds per minute. We need to implement an off-heap memstore in HBase to solve long GC pause issues.

  4. Implement a finer grained lock mechanism in DataNode.


Go to part 6, Issues

HDFS HSM and HBase: Experiment (Part 4 of 7)

This is part 4 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel) 

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Experimentation

Performance for Single Storage Type

First, we will study the HBase write performance on each single storage type (i.e. no storage type mix). This test is to show the maximum performance each storage type can achieve and provide a guide to following hierarchy storage analysis.

We study the single-storage-type performance with two data sets: 50GB and 1TB data insertion. We believe 1TB is a reasonable data size in practice. And 50GB is used to evaluate the performance uplimit as HBase performance is typically higher when the data size is small. The 50GB size was chosen because we need to avoid data out of space in test. Further, due to RAMDISK limited capacity, we have to use a small data size when storing all data in RAMDISK.

50GB Dataset in a Single Storage

The throughput and latency by YCSB for 50GB dataset are listed in Figure 2 and Figure 3. As expected, storing 50GB data in RAMDISK has the best performance whereas storing data in HDD has the worst.

Figure 2. YCSB throughput of a single storage with 50GB dataset

Note: Performance data for SSD and HDD may be better than its real capability as OS can buffer/cache data in memory.

Figure 3. YCSB latency of a single storage with 50GB dataset

Note: Performance data for SSD and HDD may be better than its real capability as OS can buffer/cache data in memory.

For throughput, RAMDISK and SSD are higher than HDD (163% and 128% of HDD throughput, respectively). But the average latencies are dramatically lower than HDD (only 11% and 32% of HDD latency). This is expected as HDD has long latency on seek operation. The latency improvements of using RAMDISK, SSD over HDD are bigger than throughput improvement. This is caused by the huge access latency gap of different storage. The latency we measured of accessing 4KB raw data from RAM, SSD and HDD are respectively 89ns, 80µs and 6.7ms (RAM and SSD are about 75000x and 84x faster than HDD).

Now consider the data of hardware (disk, network and CPU in each DataNode) collected during the tests.

Disk Throughput


Throughput theoretically (MB/s)

Throughput measured (MB/s)

Utilization

50G_HDD

127 x 8 = 1016

870

86%

50G_SSD

447 x 8 = 3576

1300

36%

50G_RAM

>11194

1800

<16%

Table 10. Disk utilization of test cases

Note: The data listed in the table for 50G_HDD and 50G_SSD are the total throughput of 8 disks.

The disk throughput is decided by factors such as access model, block size and I/O depth. The theoretical disk throughput is measured by using performance-friendly factors; in real cases they won’t be that friendly and would limit the data to lower values. In fact we observe that the disk utilization usually goes up to 100% for HDD.

Network Throughput

Each DataNode is connected by a 20Gbps (2560MB/s) full duplex network, which means both receive and transmit speed can reach 2560MB/s simultaneously. In our tests, the receive and transmit throughput are almost identical, so only the receive throughput data is listed in Table 9.


Throughput theoretically (MB/s)

Throughput measured (MB/s)

Utilization

50G_HDD

2560

760

30%

50G_SSD

2560

1100

43%

50G_RAM

2560

1400

55%

Table 11. Network utilization of test cases

CPU Utilization


Utilization

50G_HDD

36%

50G_SSD

55%

50G_RAM

60%

Table 12. CPU utilization of test cases

Clearly, the performance on HDD (50G_HDD) is bottlenecked on disk throughput. However, we can see that, for SSD and RAMDISK, neither disk, network nor CPU are the bottleneck.  So, the bottlenecks must be somewhere else. They can be somewhere in software (e.g. HBase compaction, memstore in regions, etc) or hardware (except disk, network and CPU).

To further understand why the utilization of SSD is so low and throughput is not as high as expected —only 132% of HDD, considering SSD has much higher theoretical throughput (447 MB/s vs. 127MB/s per disk) — we make an additional test to write 50GB data on four SSDs per node instead of eight SSDs per node.

Figure 4. YCSB throughput of a single storage type with 50 GB dataset stored in different number of SSDs

Figure 5. YCSB latency of a single storage type with 50 GB dataset stored in different number of SSDs

We can see that while the number of SSDs doubled, the throughput and latency of eight SSDs per node case (50G_8SSD) are improved to a much lesser extent (104% throughput and 79% latency) compared to four SSDs per node case (50G_4SSD). This means that the ability of SSDs are far from full use.

We made further dive and found that it is caused by current mainstream server hardware design. Currently, mainstream server has two SATA controllers which can support up to 12 Gbps bandwidth. This means that the total disk bandwidth is limited to around 1.5GB/s. That is the bandwidth of approximately 3.4 SSDs. You can find the details in the section Disk Bandwidth Limitation. So SATA controllers have become the bottleneck for eight SSDs per node. This explains why there is almost no improvement on YCSB throughput for eight SSDs per node. In the 50G_8SSD test, the disk is the bottleneck.

Go to part 5, Experiment (continued)

HDFS HSM and HBase: Cluster setup (Part 2 of 7)

This is part 2 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Cluster Setup

In all, five nodes are used in the testing. Figure 1 shows the topology of these nodes; the services on each node are listed in Table 3.

Figure 1. Cluster topology

  1. For HDFS, one node serves as NameNode, three nodes as DataNodes.

  2. For HBase, one HMaster is collocated together with NameNode. Three RegionServers are collocated with DataNodes.

  3. All the nodes in the cluster are connected to the full duplex 10Gbps DELL N4032 Switch. Network bandwidth for client-NameNode, client-DataNode and NameNode-DataNode is 10Gbps. Network bandwidth between DataNodes is 20Gbps (20Gbps is achieved by network bonding).

Node

NameNode

DataNode

HMaster

RegionServer

Zookeeper

YCSB client

1





2




3




4




5






Table 3. Services run on each node

Hardware

Hardware configurations of the nodes in the cluster is listed in the following tables.

Item

Model/Comment

CPU

Intel® Xeon® CPU E5-2695 v3 @ 2.3GHz, dual sockets

Memory

Micron 16GB DDR3-2133MHz, 384GB in total

NIC

Intel 10-Gigabit X540-AT2

SSD

Intel S3500 800G

HDD

Seagate Constellation™ ES ST2000NM0011 2T 7200RPM

RAMDISK

300GB

Table 4. Hardware for DataNode/RegionServer

Note: OS is stored on an independent SSD (Intel® SSD DC S3500 240GB) for both DataNodes and NameNode. The number of SSD or HDD (OS SSD not included) in DataNode varies for different testing cases., See Section ‘Methodology’ for details.

Item

Model/Comment

CPU

Intel® Xeon® CPU E5-2697 v2 @ 2.70GHz, dual sockets

Memory

Micron 16GB DDR3-2133MHz, 260GB in total

NIC

Intel 10-Gigabit X540-AT2

SSD

Intel S3500 800G

Table 5. Hardware for NameNode/HBase MASTER


Item

Value

Intel Hyper-Threading Tech

On

Intel Virtualization

Disabled

Intel Turbo Boost Technology

Enabled

Energy Efficient Turbo

Enabled

Table 6. Processor configuration

Software

Software

Details

OS

CentOS release 6.5

Kernel

2.6.32-431.el6.x86_64

Hadoop

2.6.0

HBase

1.0.0

Zookeeper

3.4.5

YCSB

0.3.0

JDK

jdk1.8.0_60

JVM Heap

NameNode:      32GB

DataNode:          4GB

HMaster:            4GB

RegionServer:  64GB

GC

G1GC

Table 7. Software stack version and configuration

NOTE: as mentioned in Methodology, HDFS and HBase have been enhanced to support this test

Benchmarks

We use YCSB 0.3.0 as the benchmark and use one YCSB client in the tests.

This is the workload configuration for 1T dataset:

# cat ../workloads/1T_workload

fieldcount=5

fieldlength=200

recordcount=1000000000

operationcount=0

workload=com.yahoo.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=0

updateproportion=0

scanproportion=0

insertproportion=0

requestdistribution=zipfian

And we use following command to start the YCSB client:

./ycsb load hbase-10 -P ../workloads/1T_workload -threads 200 -p columnfamily=family -p clientbuffering=true -s > 1T_workload.dat

Go to part 3, Tuning

HDFS HSM and HBase: Introduction (Part 1 of 7)

This is part 1 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel)

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Introduction

As more and more fast storage types (SSD, NVMe SSD, etc.) emerge, a methodology is necessary for better throughput and latency when using big data. However, these fast storage types are still expensive and are capacity limited. This study provides a guide for cluster setup with different storage media.

In general, this guide considers the following questions:

  1. What is the maximum performance user can achieve by using fast storage?

  2. Where are the bottlenecks?

  3. How to achieve the best balance between performance and cost?

  4. How to predict what kind of performance a cluster can have with different storage combinations?


In this study, we study the HBase write performance on different storage media. We leverage the hierarchy storage management support in HDFS to store different categories of HBase data on different media.

Three different types of storage (HDD, SSD and RAMDISK) are evaluated. HDD is the most popular storage in current usages, SATA SSD is a faster storage which is more and more popular now. RAMDISK is used to emulate the extremely high performance PCI-e NVMe based SSDs and coming faster storage (e.g. Intel 3D XPoint® based SSD). Due to hardware unavailability, we have to use RAMDISK to perform this emulation. And we believe our results hold for PCI-e SSD and other fast storage types.

Note: RAMDISK is logical device emulated with memory. Files stored into RAMDISK will only be cached in memory.

Methodology

We test the write performance in HBase with a tiered storage in HDFS and compare the performance when storing different HBase data into different storages. YCSB (Yahoo! Cloud Serving Benchmark, a widely used open source framework for evaluating the performance of data-serving systems) is used as the test workload.

Eleven test cases are evaluated in this study.  We split the table into 210 regions in 1 TB dataset cases to avoid region split at runtime, and we pre-split the table into 18 regions in 50 GB dataset cases.

The format of case names is <dataset size>_<storage type>.

Case Name

Storage

Dataset

Comment

50G_RAM

RAMDISK

50GB

Store all the files in RAMDISK. We have to limit data size to 50GB in this case due to the capacity limitation of RAMDISK.

50G_SSD

8 SSDs

50GB

Store all the files in SATA SSD. Compare the performance by 50GB data with the 1st case.

50G_HDD

8 HDDs

50GB

Store all the files in HDD.

1T_SSD

8 SSDs

1TB

Store all files in SATA SSD. Compare the performance by 1TB data with cases in tiered storage.

1T_HDD

8 HDDs

1TB

Store all files in HDD. Use 1TB in this case.

1T_RAM_SSD

RAMDISK

8 SSDs

1TB

Store files in a tiered storage (i.e. different storage mixed together in one test), WAL is stored in RAMDISK, and all the other files are stored in SSD.

1T_RAM_HDD

RAMDISK

8 HDDs

1TB

Store files in a tiered storage, WAL is stored in RAMDISK, and all the other files are stored in HDD.

1T_SSD_HDD

4 SSDs

4 HDDs

1TB

Store files in a tiered storage, WAL is stored in SSD, some smaller files (not larger than 1.5GB) are stored in SSD, and all the other files are stored in HDD including all archived files.

1T_SSD_All_HDD

4 SSDs

4 HDDs

1TB

Store files in a tiered storage, WAL and flushed HFiles are stored in SSD, and all the other files are stored in HDD including all archived files and compacted files.

1T_RAM_SSD_HDD

RAMDISK

4 SSDs

4 HDDs

1TB

Store files in a tiered storage, WAL is stored in RAMDISK, some smaller files (not larger than 1.5GB) are stored in SSD, and all the other files are stored in HDD including all archived files.

1T_RAM_SSD_All_HDD

RAMDISK

4 SSDs

4 HDDs

1TB

Store files in a tiered storage, WAL is stored in RAMDISK, flushed HFiles are stored in SSD, and all the other files are stored in HDD including all archived files and compacted files.

Table 1. Test Cases

NOTE: In all 1TB test cases, we pre-split the HBase table into 210 regions to avoid the region split at runtime.

The metrics in Table 2 are collected during the test for performance evaluation.

Metrics

Comment

Storage media level

IOPS, throughput (sequential/random R/W)

OS level

CPU usage, network IO, disk IO, memory usage

YCSB Benchmark level

Throughput, latency

Table 2. Metrics collected during the tests

Go to Part 2, Cluster Setup

Friday April 22, 2016

HDFS HSM and HBase: Tuning (Part 3 of 7)

This is part 3 of a 7 part report by HBase Contributor, Jingcheng Du and HDFS contributor, Wei Zhou (Jingcheng and Wei are both Software Engineers at Intel) 

  1. Introduction
  2. Cluster Setup
  3. Tuning
  4. Experiment
  5. Experiment (continued)
  6. Issues
  7. Conclusions

Stack Enhancement and Parameter Tuning

Stack Enhancement

To perform the study, we made a set of enhancements in the software stack:

  • HDFS:

    • Support a new storage RAMDISK

    • Add file level mover support, a user can move blocks per file without scanning all metadata in NameNode

  • HBase:

    • WAL, flushed HFiles, HFiles generated in compactions, and archived HFiles can be stored in different storage

    • When renaming HFiles across storage, the blocks of that file would be moved to the target storage asynchronously

HDFS/HBase Tuning

This step is to find the best configurations for HDFS and HBase.

Known Key Performance Factors in HBase

These are the key performance factors in HBase:

  1. WAL: write ahead log to guarantee the non-volatility and consistency of the data. Each record that is inserted to HBase must be written to WAL which can slow down user operations. It’s latency-sensitive.

  2. Memstore and Flush: The records inserted into HBase are cached in memstore, and when reaches a threshold the memstore is flushed to a store file. Slow flush can lead to high GC (Garbage Collection) pause, and make memory usage reach the thresholds in regions and region server, which can block the user operations.

  3. Compaction and Number of Store Files: HBase compaction compacts small store files to a larger one which can reduce the number of store files and accelerate the reading, but it can generate heavy I/O and consume the disk bandwidth in runtime. Less compaction can accelerate the writing but generates too many store files, which slow down the reading. When there are too many store files, the memstore flush can be slowed down which can lead to a large memstore and further slow the user operations.

Based on this understanding, the following are the tuned parameters we finally used.

Property

Value

dfs.datanode.handler.count

64

dfs.namenode.handler.count

100

Table 8. HDFS configuration

Property

Value

hbase.regionserver.thread.compaction.small

3 for non-SSD test cases.

8 for all SSD related test cases.

hbase.hstore.flusher.count

5 for non-SSD test cases.

15 for all SSD related test cases.

hbase.wal.regiongrouping.numgroups

4

hbase.wal.provider

multiwal

hbase.hstore.blockingStoreFiles

15

hbase.regionserver.handler.count

200

hbase.hregion.memstore.chunkpool.maxsize

1

hbase.hregion.memstore.chunkpool.initialsize

0.5

Table 9. HBase configuration

Go to part 4, Experiment

Wednesday April 13, 2016

Investing In Big Data: Apache HBase

This is the fourth in a series of posts on "Why We Use Apache HBase", in which we let HBase users and developers borrow our blog so they can showcase their successful HBase use cases, talk about why they use HBase, and discuss what worked and what didn't.

Lars Hofhansl and Andrew Purtell are HBase Committers and PMC members. At Salesforce.com, Lars is a Vice President and Software Engineering Principal Architect, and Andrew is a Cloud Storage Architect.

An earlier version of the discussion in this post was published here on the Salesforce + Open Source = ❤ blog.

- Andrew Purtell


Investing In Big Data: Apache HBase

 By © Earth at Night / CC BY 3.0

The world is good at making data. You can see it in every corner of commerce and industry: everything we interact with is getting smarter, and producing a massive stream of readings, geolocations, images, and more. From medical devices to jet engines, it’s transforming every part of the modern world. And it’s accelerating!

Salesforce’s products are all about helping our customers connect with their customers. So obviously, a big part of that is equipping them to interact with all of the data a customer’s interactions might generate, in whatever quantities it shows up in. We do that in many ways, across all of our products. 

In this post, we’d like to zoom in on one particular Open Source data system we use and contribute heavily to: Apache HBase. If you’re not familiar with HBase, this post will start with a few high level concepts about the system, and then go into how it fits in at Salesforce. 

What IS HBase? 

HBase is an open source distributed database. It’s designed to store record-oriented data across a scalable cluster of machines. We sometimes refer to it as a “sparse, consistent, distributed, multi-dimensional, persistent, sorted map”. This usually makes people say “Wat??”. So, to break that down a little bit:

  • distributed :  rows are spread over many machines;
  • consistent :  it’s strongly consistent (at a per-row level);
  • persistent :  it stores data on disk with a log, so it sticks around;
  • sorted : rows are stored in sorted order, making seeks very fast;
  • sparse :  if a row has no value in a column, it doesn’t use any space;
  • multi-dimensional :  data is addressable in several dimensions: tables, rows, columns, versions, etc.

Think of it as a giant list of key / value pairs, spread over a large cluster, sorted for easy access.

People often refer to HBase as a “NoSQL” store–a term coined back in 2009 to refer to a big cohort of similar systems that were doing data storage without SQL (Structured Query Language). Contributors to the HBase project will tell you they have always disliked that term, though; a toaster is also NoSQL. It’s better to say HBase is architected differently than a typical relational database engine, and can scale out better for some use cases.

Google was among the first companies to move in this direction, because they were operating at the scale of the entire web. So, they long ago built their infrastructure on top of a new kind of system (BigTable, which is the direct intellectual ancestor of HBase). It grew from there, with dozens of Open Source variants on the theme emerging in the space of a few years: Cassandra, MongoDB, Riak, Redis, CouchDB, etc. 

Why use a NoSQL store, if you already have a relational database?

Let’s be clear: relational databases are terrific. They’ve been the dominant form of data storage on Earth for nearly three decades, and that’s not by accident. In particular, they give you one really killer feature: the ability to decompose the physical storage of data into different conceptual buckets (entities, aka tables), with relationships to each other … and to modify the state of many related values atomically (transactions). This incurs a cost (for finding and re-assembling the decomposed data when you want to read it) but relational database query planners have gotten so shockingly good that this is actually a perfectly good trade-off in most cases.

Salesforce is deeply dependent on relational databases. A majority of our row-oriented data still lives there, and they are integral to the basic functionality of the system. We’ve been able to scale them to massive load, both via our unique data storage architecture, and also by sharding our entire infrastructure (more on that in the The Crystal Shard, a post by Ian Varley on the Salesforce blog). They’re not going anywhere. On the contrary, we do constant and active research into new designs for scaling and running relational databases. 

But, it turns out there are a subset of use cases that have very different requirements from relational data. In particular, less emphasis on webs of relationships that require complex transactions for correctness, and more emphasis on large streams of data that accrue over time, and need linear access characteristics. You certainly can store these in an RDBMS. But, when you do, you’re essentially paying a penalty (performance and scale limitations) for features you don’t need; a simpler design can scale more powerfully.

It’s for those new use cases–and all the customer value they unlock–that we’ve added HBase to our toolkit.

Of all the NoSQL stores, why HBase?

This question comes up a lot: people say, “I heard XYZ-database is web scale, you should use that!”. The world of “polyglot persistence” has produced a boatload of choices, and they all have their merits. In fact, we use almost all of them somewhere in the Salesforce product suite: Cassandra, Redis, CouchDB, MongoDB, etc.

To choose HBase as a key area of investment for Salesforce Core, we went through a pretty intense “bake-off” process that included evaluations of several different systems, with experiments, spikes, POCs, etc. The decision ultimately came down to three big points for us:

  1. HBase is a strongly consistent store. In the CAP Theorem, that means it’s a (CP) store, not an (AP) store. Eventual consistency is great when used for the right purpose, but it can tend to push challenges up to the application developer. We didn’t think we’d be able to absorb that extra complexity, for general use in a product with such a large surface area. 

  2. It’s a high quality project. It did well in our benchmarks and tests, and is well respected in the community. Facebook built their entire Messaging infrastructure on HBase (as well as many other things), and the Open Source community is active and friendly.

  3. The Hadoop ecosystem already had an operational presence at Salesforce. We’ve been using Hadoop in the product for ages, and we already had a pretty good handle on how to deploy and operate it. HBase can use Hadoop’s distributed filesystem for persistence and offers first class integration with MapReduce (and, coming soon, Spark), so is a way to level up existing Hadoop deployments with modest incremental effort.

Our experience was (and still is) that HBase wasn’t the easiest to get started with, but it was the most dependable at scale; we sometimes refer to it as the “industrial strength” scalable store, ready for use in demanding enterprise situations, and taking issues like data durability and security extremely seriously. 

HBase (and its API) is also broadly used in the industry. HBase is an option on Amazon’s EMR, and is also available as part of Microsoft’s Azure offerings. Google Cloud includes a hosted BigTable service sporting the de-facto industry standard HBase client API. (It’s fair to say the HBase client API has widespread if not universal adoption for Hadoop and Cloud storage options, and will likely live on beyond the HBase engine proper.) 

When does Salesforce use HBase?

There are three indicators we look at when deciding to run some functionality on HBase. We want things that are big, record-oriented, and transactionally independent

By “big”, we mean things in the ballpark of hundreds of millions of rows per tenant, or more. HBase clusters in the wild have been known to grow to thousands of compute nodes, and hundreds of Petabytes, which is substantially bigger than we’d ever grow a single one of our instance relational databases to. (None of our individual clusters are that big yet, mind you.) Data size is not only not a challenge for HBase, it’s a desired feature!

By “record-oriented”, we mean that it “looks like” a database, not a file store. If your data can be modeled as big BLOBs, and you don’t need to independently read or write small bits of data in the middle of those big blobs, then you should use a file store.

Why does it matter if things are record-oriented? After all, HBase is built on top of HDFS, which is … a file store. The difference is that the essential function that HBase plays is specifically to let you treat big immutable files as if they were a mutable database. To do this magic, it uses something called a Log Structured Merge Tree, which provides for both fast reads and fast writes. (If that seems impossible, read up on LSMs; they’re legitimately impressive.)

And, what do we mean by “transactionally independent”? Above, we described that a key feature of relational databases is their transactional consistency: you can modify records in many different tables and have those modifications either commit or rollback as a unit. HBase, as a separate data store, doesn’t participate in these transactions, so for any data that spans both stores, it requires application developers to reason about consistency on their own. This is doable, but it is tricky. So, we prefer to emphasize those cases where that reasoning is simple by design

(For the hair-splitters out there, note that HBase does offer consistent “transactions”, but only on a single-row basis, not across different rows, objects, or (most importantly) different databases.)

One criterion for using HBase that you’ll notice we didn’t mention is that as a “NoSQL” store, you wouldn’t use it for something that called for using SQL. The reason we didn’t mention that is that it isn’t true! We built and open sourced a library called “Phoenix” (which later became Apache Phoenix) which brings a SQL access layer to HBase. So, as they say, “We put the SQL back in NoSQL”.

We treat HBase as a “System Of Record”, which means that we depend on it to be every bit as secure, durable and available as our other data stores. Getting it there required a lot of work: accounting for site switching, data movement, authenticated access between subsystems, and more. But we’re glad we did!

What features does Salesforce use HBase for?

To give you a concrete sense of some of what HBase is used for at Salesforce, we’ll mention a couple of use cases briefly. We’ll also talk about more in future posts.

The main place you may have seen HBase in action in Salesforce to date is what we call Salesforce Shield. Shield is a connected suite of product features that enable enterprise businesses to encrypt data and track user actions. This is a requirement in certain protected sectors of business, like Health and Government, where compliance laws dictate the level of retention and oversight of access to data.

One dimension of this feature is called Field Audit Trail (FAT). In Salesforce, there’s an option that allows you to track changes made to fields (either directly by users through a web UI or mobile device, or via other applications through the API). This historical data is composed of “before” and “after” values for every tracked field of an object. These stick around for a long time, as they’re a matter of historical record. That means that if you have data that changes very frequently, this data set can grow rapidly, and without any particular bound. So we use HBase as a destination for moving older sets of audit data over, so it’s still accessible via the API, but doesn’t have any cost to the relational database optimizer. The same principle applies to other data that behaves the same way; we can archive this data into HBase as a cold store.

Another part of shield that uses HBase is the Event Monitoring feature. The initial deployment was based on Hadoop, and culls a subset of application log lines, making them available to customers directly as downloadable files. New work is in progress to capture some event data into HBase, like Login Forensics, in real time, so it can be queried interactively. This could be useful for security and audit, limit monitoring, and general usage tracking. We’re also introducing an asynchronous way to query this data so you can work with the potentially large resulting data sets in sensible ways.

More generally, the Platform teams at Salesforce have been cooking up a general way for customers to use HBase. It’s called BigObjects (more here). BigObjects behave in most (but not all) ways like standard relational-database-backed objects, and are ideally suited for use cases that require ingesting large amounts of read-only data from external systems: log files, POS data, event data, clickstreams, etc.

We’re adding more features that use HBase in every release, like improving mention relevancy in Chatter, tracking Apex unit test results, caching report data, and more. We’ll post follow-ups here and on the Salesforce Engineering Medium Channel about these as they evolve.

Contributing To HBase

So, we’ve got HBase clusters running in data centers all over the world, powering new features in the product, and running on commodity hardware that can scale linearly.

But perhaps more importantly, Salesforce has put a premium on not just being a user of HBase, but also being a contributor. The company employs multiple committers on the Apache project, including both of us (Lars is a committer and PMC member, and Andrew is an Apache VP and PMC Chair). We’ve written and committed hundreds of patches, including some major features. As committers, we’ve reviewed and committed thousands of patches, and served as release managers for the 0.94 release (Lars) and 0.98 release (Andrew).

We’ve also presented dozens of talks at conferences about HBase, including HBaseCon and OSCON.

We’re committed to doing our work in the community, not in a private fork except where temporarily needed for critical issues. That’s a key tenet of the team’s OSS philosophy — no forking! — that we’ll talk about more in an upcoming post on the Salesforce Engineering Medium Channel.

By the way, outside of the Core deployment of HBase that we’ve described here, there are actually a number of other places in the larger Salesforce universe where HBase is deployed as well, including Data.com, the Marketing Cloud (more here), and Argus, an internal time-series monitoring service. More on these uses soon, too!

Conclusion

We’re happy to have been a contributing part of the HBase community these last few years, and we’re looking forward to even more good stuff to come.

Thursday December 17, 2015

Offheaping the Read Path in Apache HBase: Part 1 of 2

Detail on the work involved making it so the Apache HBase read path could work against off heap memory (without copy).[Read More]

Offheaping the Read Path in Apache HBase: Part 2 of 2

by HBase Committers Anoop Sam John, Ramkrishna S Vasudevan, and Michael Stack

This is part two of a two part blog. Herein we compare before and after off heaping. See part one for preamble and detail on work done.

Performance Results

There were two parts to our performance measurement.  

  1. Using HBase’s built-in Performance Evaluation (PE) tool.  

  2. Using YCSB to measure the throughput.

The PE test was conducted on a single node machine.  Table is created and loaded with 100 GB of data.  Table has one CF and one column per row. Each cell value size is 1K. Configuration of the node :

System configuration

CPU : Intel(R) Xeon(R) CPU with 8 cores.
RAM : 150 GB
JDK : 1.8

HBase configuration

HBASE_HEAPSIZE = 9 GB
HBASE_OFFHEAPSIZE = 105 GB
<property>
  <name>hbase.bucketcache.ioengine</name>
  <value>offheap</value>
</property>
<property>
  <name>hfile.block.cache.size</name>
  <value>0.2</value>
</property>
<property>
  <name>hbase.bucketcache.size</name>
  <value> 104448 </value>
</property>

20% of the heap is allocated to the L1 cache (LRU cache). When L2 is enabled, L1 holds no data, just index and bloom filter blocks. 102 GB off heap space is allocated to the L2 cache (BucketCache)

Before performing the read tests, we have made sure that all data is loaded into BucketCache so there is no i/o.  The read workloads of PE tool run with multiple threads.  We considerthe average completion time for each of the threads to do the required reads.

1. Each thread does 100 row multi get operations for 1000000 times. We can see that there is a 55 – 82 % reduction in average run time. See the graph below for the test for 5 to 75 threads reading. Y axis shows average completion time, in seconds, for one thread.

Here each thread is doing 100000000 row get and converting this to throughput numbers we can see


5 Threads

10 Threads

20 Threads

25 Threads

50 Threads

75 Threads

Throughput Without Patch

5594092.638

7152564.19

7001330.25

6920798.38

6113142.03

5463246.92

Throughput With HBASE-11425

11353315.17

19782393.7

28477858.5

28216704.3

30229746.1

30647270.3

Throughput gain

2x

2.7x

4x

4x

4.9x

5.6x


So without the patch case, at the 20 threads level, the system goes to peak load situation and throughput starts to fall off. But with HBASE-11425 this is not the case and even with 75 threads. It is mostly linear scaling with more loading.  The major factor which helps us here is reduced GC activity.

2. Each thread is doing a range scan of 10000 rows with filtering of all the data on the server side. The filtering is done to see the server side gain alone and avoid any impact of network and/or client app side bottleneck. Each thread is doing the scan operation 10000 times. We can see that there is 55 – 74 % reduction in average run time of each thread. See below the graph for the test for 5 to 75 threads reading. Y axis shows average completion time, in seconds, for one thread.




3.  Another range scan test is performed with part of data returned back to client.  The test will return 10% of total rows back to the client and the remaining rows are filtered out at the server. The below graph is for a test with 10, 20, 25 and 50 threads. Again Y axis gives the average thread completion time, in seconds. The gain is  28 – 39% latency.


The YCSB test is done in same cluster with 90 GB of data. We had similar system configuration and HBase configuration as for the PE tests.

The YCSB setup involves creating a table with a single column family with around 90 GB of data. There are 10 columns each with 100 bytes of data (each row having 1k of data). All the readings are taken after ensuring that the entire data set is in the BucketCache. The multi get test includes each thread doing 100 row gets and 5000000 such operations. Range scan does random range scans with 1000000 operations.


Threads

With HBASE-11425(Ops/sec)

Withoutpatch(Ops/sec)

10

28045.53

23277.97

25

45767.99

25922.18

50

58904.03

24558.72

75

63280.86

24316.74



Threads

With HBASE-11425(Ops/Sec)

Withoutpatch(Ops/sec)

10

5332.451701

4139.416

25

8685.456204

5340.796

50

9180.235565

5149

75

9019.192842

4981.8

For multi get there is 20 – 160 % throughput gain whereas for range scan it is 20 - 80%.

With HBASE-11425 we can see there is linear throughput increase with more threads whereas old code starts performing badly when more threads (See when 50 and 75 threads)

GC graphs

With HBASE-11425, we serve data directly from the off heap cache rather than copy each of the blocks on heap for each of the reads.  So we should be doing much better with respect to GC on the  RegionServer side.  Below are the GC graph samples taken on one RegionServer during the PE test run.  We can clearly notice that with the changes in HBASE-11425, we are doing much less (better) GC.

MultiGets – without HBASE-11425 (25 threads)

multiGets_69machine_withoutpatch

Multigets – with HBASE-11425(25 threads)

multiGets_69machine_withpatch


ScanRange10000 – without HBASE-11425 (20 threads)

scanRange_1k_50threads_withoutpatch

ScanRange10000 – with HBASE-11425 (20 threads)


scanRange_1k_50threads_withpatch


Future Work

One implication of the findings above is that we should run with off heap Cache on always. We were reluctant to do this in the past when reads from the off heap cache took longer. This is no longer the case. We will look into making this option on by default. Also, this posting has described our conversion of the read pipeline to make it run with offheap buffers. Next up, naturally, would be making the write pipeline offheap.

Conclusion

Some parts of this work made it into branch-1 but to run with a fully off heap read path, you will have to wait on the HBase 2.0 release which should be available early next year (2016). Enabling L2 (BucketCache) in off heap mode will automatically turn on the off heap mechanisms described above. We hope you enjoy these improvements made to the Apache HBase read path.

Thursday October 08, 2015

Imgur Notifications: From MySQL to HBase

This is the third in a series of posts on "Why We Use Apache HBase", in which we let HBase users and developers borrow our blog so they can showcase their successful HBase use cases, talk about why they use HBase, and discuss what worked and what didn't.

Carlos J. Espinoza is an engineer at Imgur.

An earlier version of the discussion in this post was published here on the Imgur Engineering blog.

- Andrew Purtell


Imgur Notifications: From MySQL to HBase

Imgur is a heavy user of MySQL. It has been a part of our stack since our beginning. However, with our scale it has become increasingly difficult to throw more features at it. For our latest feature upgrade, we re-implemented our notifications system and migrated it over from MySQL to HBase. In this post, I will talk about how HBase solved our use case and the features we exploited.

To add some context, previously we supported two types of notifications: messages and comment replies, all stored in MySQL. For this upgrade, we decided to support several additional notification types. We also introduced rules around when a notification should be delivered. This change in spec made it challenging to continue with our previous model, so we started from scratch.

Early in the design phase, we envisioned a world where MySQL remained the primary store. We put together some schemas, some queries and, for the better, stopped ourselves from making a huge mistake. We had to create a couple columns for every type of notification. Creating a new notification type afterwards would mean a schema change. Our select queries would require us to join against other application tables. We designed an architecture that could work, but we would sacrifice decoupling, simplicity, scalability, and extensibility.

Some of our notifications require that they only be delivered to the user once a milestone is crossed. For instance, if a post reaches 100 points, Imgur will notify the user at the 100 point mark. We don’t want to bother users with 99 notifications in between. So, at scale, how do we know that the post has reached 100 points?

A notification could have multiple milestones. We only want to deliver once the milestone is hit.

Considering MySQL is our primary store for posts, one way to do it is to increment the points counter in the posts table, then execute another query fetching the row and check if the points reached the threshold of 100. This approach has a few issues. Increment and then fetch is a race condition. Two clients could think they reached 100 points, delivering two notifications for the same event. Another problem is the extra query. For every increment, we must now fetch the volatile column, adding more stress to MySQL.

Though it is technically possible to do this in MySQL using transactions and consistent read locks, lock contention would make it possibly very expensive with votes as it’s one the most frequent operations on our site. Seeing as we already use HBase in other parts of our stack, we switched gears and we built our system on top of it. Here is how we use it to power notifications in real time and at scale.

Sparse Columns

At Imgur, each notification is composed of one or more events. The following image of our notifications dropdown illustrates how our notifications are modeled.

As illustrated, each notification is composed of one or more events. A notification maps to a row in HBase, and each event maps to multiple columns, one of which is a counter. This model makes our columns very sparse as different types of notifications have different types of events. In MySQL, this would mean a lot of NULL values. We are not tied to a strict table schema, so we can easily add other notification types using the same model.

Atomic Increments

HBase has an atomic increment operation that returns the new value in the same call. It’s a simple feature, but our implementation depends on this operation. This allows our notification delivery logic on the client to be lightweight: increment and only deliver the notification if and only if a milestone is crossed. No extra calls. In some cases, this means we now keep two counters. For instance, the points count in the MySQL table for posts, and the points count in HBase for notifications. Technically, they could get out of sync, but this is an edge case we optimize for.

Another benefit of doing increments in HBase is that it allows us to decouple the notifications logic from the application logic. All we need to know to deliver a notification is whether its counter has crossed a pre-defined threshold. For instance, we no longer need to know how to fetch a post and get its point count. HBase has allowed us to solve our problem in a more generic fashion.

Fast Table Scans

We also maintain a secondary order table. It stores notification references ordered by when they were last delivered using reversed timestamps. When users open their notifications dropdown, we fetch their most recent notifications by performing a table scan limited by their user ID. We can also support scanning for different types of notifications by using scan filters.

Other Benefits

With HBase we gain many other benefits, like linear scalability and replication. We can forward data to another cluster and run batch jobs on that data. We also get strong consistency. It is extremely important for notifications to be delivered exactly once when a milestone is crossed. We can make that guarantee knowing that HBase has strong row level consistency. It’s likely that we’ll use versioning in the future, but even without use of it, HBase is a great choice for our use case.

Imgur notifications is a young project, and we’ll continue to make improvements to it. As it matures and we learn more from it, I hope to share what we’ve built with the open source community.

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation