Entries tagged [offheap]

Thursday March 09, 2017

Offheap Read-Path in Production - The Alibaba story

By Yu Li (HBase Committer/Alibaba), Yu Sun (Alibaba), Anoop Sam John (HBase PMC/Intel), and Ramkrishna S Vasudevan (HBase PMC/Intel)

Introduction

HBase is the core storage system in Alibaba’s Search Infrastructure. Critical e-commerce data about products, sellers and promotions etc. are all synced into HBase from various online databases. We query HBase to build and provide real time updates on the search index. In addition, user behavior data, such as impressions, clicks and transactions will also be streamed into HBase. They serve as feature data for our online machine learning system, which optimizes the personalized search result in real time. The whole system produces mixed workloads on HBase that includes bulkload/snapshot for full index building, batch mutation for real time index updates and streaming/continuous query for online machine learning. Our biggest HBase cluster has reached more than 1500 nodes and 200,000 regions. It routinely serves tens of millions QPS.


Both latency and throughput are important for our HBase deploy. From the latency perspective, it directly affects how quickly users can search an item after it has been posted as well as how ‘real-time’ we can run our inventory accounting. From the throughput perspective, it decides the speed of machine learning program processing, and thus the accuracy of recommendations made. What’s more, since data is distributed through the cluster and accesses are balanced, applications are sensitive to latency spikes on a single node, which makes GC a critical factor in our system servicing capability.


By caching more data in memory, the read latency (and throughput) can be greatly improved. If we can get our data from local cache, we save having to make a trip to HDFS. Apache HBase has two layers of data caching. There is what we call “L1” caching, our first caching tier – which caches data in an on heap Least Recently Used (LRU) cache -- and then there is an optional, “L2” second cache tier (aka Bucket Cache).


Bucket Cache can be configured to keep its data in a file -- i.e. caching data in a local file on disk -- or in memory. File mode usually is able to cache more data but there will be more attendant latency reading from a file vs reading from memory. Bucket Cache can also be configured to use memory outside of the Java heap space (‘offheap’) so users generally configurea a large L2 cache with offheap memory along with a smaller on heap L1 cache.


At Alibaba we use an offheap L2 cache dedicating 12GB to Bucket Cache on each node. We also backported a patch currently in master branch only (to be shipped in the coming hbase-2.0.0) which makes it so the hbase read path runs offheap end-to-end. This combination improved our average throughput significantly. In the below sections, we’ll first talk about why the off-heaping has to be end-to-end, then introduce how we back ported the feature from master branch to our customized 1.1.2, and at last show the performance with end-to-end read-path offheap in an A/B test and on Singles’ Day (11/11/2016).


Necessity of End-to-end Off-heaping

Before offheap, the QPS curve looked like below from our A/B test cluster


Throughput_without_offheap(AB_Testing_450_nodes).png


We could see that there were dips in average throughput. Concurrently, the average latency would be high during these times.


Checking RegionServer logs, we could see that there were long GC pauses happening. Further analysis indicated that when disk IO is fast enough, as on PCIe-SSD, blocks would be evicted from cache quite frequently even when there was a high cache hit ratio. The eviction rate was so high that the GC speed couldn’t keep up bringing on frequent long GC pauses impacting throughput.


Looking to improve throughput, we tried the existing Bucket Cache in 1.1.2 but found GC was still heavy. In other words, although Bucket Cache in branch-1 (branch for current stable releases) already supports using offheap memory for Bucket Cache, it tends to generate lots of garbages. To understand why end-to-end off-heaping is necessary, let’s see how reads from Bucket cache work in branch-1. But before we do this, lets understand how bucket cache itself has been organized.


The allocated offheap memory is reserved as DirectByteBuffers, each of size 4 MB. So we can say that physically the entire memory area is split into many buffers each of size 4 MB.  Now on top of this physical layout, we impose a logical division. Each logical area is sized to accommodate different sized HFile blocks (Remember reads of HFiles happen as blocks and block by block it will get cached in L1 or L2 cache). Each logical split accommodates different sized HFile blocks from 4 KB to 512 KB (This is the default. Sizes are configurable). In each of the splits, there will be more that one slot into which we can insert a block. When caching, we find an appropriately sized split and then an empty slot within it and here we insert the block. Remember all slots are offheap. For more details on Bucket cache, refer here [4]. Refer to the HBase Reference Guide [5] for how to setup Bucket Cache.


In branch-1, when the read happens out of an L2 cache, we will have to copy the entire block into a temporary onheap area. This is because the HBase read path assumes block data is backed by an onheap byte array.  Also as per the above mentioned physical and logical split, there is a chance that one HFile block data is spread across 2 physical ByteBuffers.


When a random row read happens in our system, even if the data is available in L2 cache, we will end up reading the entire block -- usually ~64k in size -- into a temporary onheap allocation for every row read. This creates lots of garbage (and please note that without the HBASE-14463 fix, this copy from offheap to onheap reduced read performance a lot). Our read workload is so high that this copy produces lots of GCs, so we had to find a way to avoid the need of copying block data from offheap cache into temp onheap arrays.

How was it achieved? - Our Story

The HBASE-11425 Cell/DBB end-to-end on the read-path work in the master branch, avoids the need to copy offheap block data back to onheap when reading. The entire read path is changed to work directly off the offheap Bucket Cache area and serve data directly from here to clients (see the details of this work and performance improvement details here [1], and [2]). So we decided to try this project in our custom HBase version based on 1.1.2 backporting it from the master branch.


The backport cost us about 2 people months, including getting familiar with and analysis of the JIRAs to port, fix UT failures, fixing problems found in functional testing (HBASE-16609/16704), and resolving compatibility issues (HBASE-16626). We have listed the full to-back-port JIRA list here [3] and please refer to it for more details if interested.


About configurations, since for tables of different applications use different block sizes -- from 4KB to 512KB -- the default bucket splits just worked for our use case. We also kept the default values for other configurations after carefully testing and even after tuning while in production. Our configs are listed below:


Alibaba’s Bucket Cache related configuration

<property>

     <name>hbase.bucketcache.combinedcache.enabled</name>

     <value>true</value>

   </property>


   <property>

     <name>hbase.bucketcache.ioengine</name>

     <value>offheap</value>

   </property>


   <property>

     <name>hbase.bucketcache.size</name>

     <value>12288</value>

   </property>


   <property>

     <name>hbase.bucketcache.writer.queuelength</name>

     <value>64</value>

   </property>


   <property>

     <name>hbase.bucketcache.writer.threads</name>

     <value>3</value>

   </property>


How it works? - A/B Test and Singles’ Day

We tested the performance on our A/B test cluster (with 450 physical machines, and each with 256G memory + 64 core) after back porting and got a better throughput as illustrated below

Throughput_with_offheap(AB_Testing_450_nodes).png


It can be noted that now the average throughput graph is very much more linear and there are no more dips in throughput across time.


The version with the offheap read path feature was released on October 10th and has been online ever since (more than 4 months). Together with the NettyRpcServer patch (HBASE-15756), we successfully made it through our 2016 Singles’ Day, with peaks at 100K QPS on a single RS.


1.png

2.png



[1] https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in

[2] http://www.slideshare.net/HBaseCon/offheaping-the-apache-hbase-read-path

[3] https://issues.apache.org/jira/browse/HBASE-17138

[4] https://issues.apache.org/jira/secure/attachment/12562209/Introduction%20of%20Bucket%20Cache.pdf

[5] http://hbase.apache.org/book.html#offheap.blockcache

Thursday December 17, 2015

Offheaping the Read Path in Apache HBase: Part 1 of 2

Detail on the work involved making it so the Apache HBase read path could work against off heap memory (without copy).[Read More]

Offheaping the Read Path in Apache HBase: Part 2 of 2

by HBase Committers Anoop Sam John, Ramkrishna S Vasudevan, and Michael Stack

This is part two of a two part blog. Herein we compare before and after off heaping. See part one for preamble and detail on work done.

Performance Results

There were two parts to our performance measurement.  

  1. Using HBase’s built-in Performance Evaluation (PE) tool.  

  2. Using YCSB to measure the throughput.

The PE test was conducted on a single node machine.  Table is created and loaded with 100 GB of data.  Table has one CF and one column per row. Each cell value size is 1K. Configuration of the node :

System configuration

CPU : Intel(R) Xeon(R) CPU with 8 cores.
RAM : 150 GB
JDK : 1.8

HBase configuration

HBASE_HEAPSIZE = 9 GB
HBASE_OFFHEAPSIZE = 105 GB
<property>
  <name>hbase.bucketcache.ioengine</name>
  <value>offheap</value>
</property>
<property>
  <name>hfile.block.cache.size</name>
  <value>0.2</value>
</property>
<property>
  <name>hbase.bucketcache.size</name>
  <value> 104448 </value>
</property>

20% of the heap is allocated to the L1 cache (LRU cache). When L2 is enabled, L1 holds no data, just index and bloom filter blocks. 102 GB off heap space is allocated to the L2 cache (BucketCache)

Before performing the read tests, we have made sure that all data is loaded into BucketCache so there is no i/o.  The read workloads of PE tool run with multiple threads.  We considerthe average completion time for each of the threads to do the required reads.

1. Each thread does 100 row multi get operations for 1000000 times. We can see that there is a 55 – 82 % reduction in average run time. See the graph below for the test for 5 to 75 threads reading. Y axis shows average completion time, in seconds, for one thread.

Here each thread is doing 100000000 row get and converting this to throughput numbers we can see


5 Threads

10 Threads

20 Threads

25 Threads

50 Threads

75 Threads

Throughput Without Patch

5594092.638

7152564.19

7001330.25

6920798.38

6113142.03

5463246.92

Throughput With HBASE-11425

11353315.17

19782393.7

28477858.5

28216704.3

30229746.1

30647270.3

Throughput gain

2x

2.7x

4x

4x

4.9x

5.6x


So without the patch case, at the 20 threads level, the system goes to peak load situation and throughput starts to fall off. But with HBASE-11425 this is not the case and even with 75 threads. It is mostly linear scaling with more loading.  The major factor which helps us here is reduced GC activity.

2. Each thread is doing a range scan of 10000 rows with filtering of all the data on the server side. The filtering is done to see the server side gain alone and avoid any impact of network and/or client app side bottleneck. Each thread is doing the scan operation 10000 times. We can see that there is 55 – 74 % reduction in average run time of each thread. See below the graph for the test for 5 to 75 threads reading. Y axis shows average completion time, in seconds, for one thread.




3.  Another range scan test is performed with part of data returned back to client.  The test will return 10% of total rows back to the client and the remaining rows are filtered out at the server. The below graph is for a test with 10, 20, 25 and 50 threads. Again Y axis gives the average thread completion time, in seconds. The gain is  28 – 39% latency.


The YCSB test is done in same cluster with 90 GB of data. We had similar system configuration and HBase configuration as for the PE tests.

The YCSB setup involves creating a table with a single column family with around 90 GB of data. There are 10 columns each with 100 bytes of data (each row having 1k of data). All the readings are taken after ensuring that the entire data set is in the BucketCache. The multi get test includes each thread doing 100 row gets and 5000000 such operations. Range scan does random range scans with 1000000 operations.


Threads

With HBASE-11425(Ops/sec)

Withoutpatch(Ops/sec)

10

28045.53

23277.97

25

45767.99

25922.18

50

58904.03

24558.72

75

63280.86

24316.74



Threads

With HBASE-11425(Ops/Sec)

Withoutpatch(Ops/sec)

10

5332.451701

4139.416

25

8685.456204

5340.796

50

9180.235565

5149

75

9019.192842

4981.8

For multi get there is 20 – 160 % throughput gain whereas for range scan it is 20 - 80%.

With HBASE-11425 we can see there is linear throughput increase with more threads whereas old code starts performing badly when more threads (See when 50 and 75 threads)

GC graphs

With HBASE-11425, we serve data directly from the off heap cache rather than copy each of the blocks on heap for each of the reads.  So we should be doing much better with respect to GC on the  RegionServer side.  Below are the GC graph samples taken on one RegionServer during the PE test run.  We can clearly notice that with the changes in HBASE-11425, we are doing much less (better) GC.

MultiGets – without HBASE-11425 (25 threads)

multiGets_69machine_withoutpatch

Multigets – with HBASE-11425(25 threads)

multiGets_69machine_withpatch


ScanRange10000 – without HBASE-11425 (20 threads)

scanRange_1k_50threads_withoutpatch

ScanRange10000 – with HBASE-11425 (20 threads)


scanRange_1k_50threads_withpatch


Future Work

One implication of the findings above is that we should run with off heap Cache on always. We were reluctant to do this in the past when reads from the off heap cache took longer. This is no longer the case. We will look into making this option on by default. Also, this posting has described our conversion of the read pipeline to make it run with offheap buffers. Next up, naturally, would be making the write pipeline offheap.

Conclusion

Some parts of this work made it into branch-1 but to run with a fully off heap read path, you will have to wait on the HBase 2.0 release which should be available early next year (2016). Enabling L2 (BucketCache) in off heap mode will automatically turn on the off heap mechanisms described above. We hope you enjoy these improvements made to the Apache HBase read path.

Friday August 08, 2014

Comparing BlockCache Deploys

Comparing BlockCache Deploys

St.Ack on August 7th, 2014

A report comparing BlockCache deploys in Apache HBase for issue HBASE-11323 BucketCache all the time! Attempts to roughly equate five different deploys and compare how well they do under four different loading types that vary from no cache-misses through to missing cache most of the time. Makes recommendation on when to use which deploy.

Prerequisite

In Nick Dimiduk's BlockCache 101 blog post, he details the different options available in HBase. We test what remains after purging SlabCache and the caching policy implementation DoubleBlockCache which have been deprecated in HBase 0.98 and removed in trunk because of Nick's findings and that of others.

Nick varies the JVM Heap+BlockCache sizes AND dataset size.  This report keeps JVM Heap+BlockCache size constant and varies the dataset size only. Nick looks at the 99th percentile only.  This article looks at that, as well as GC, throughput and loadings. Cell sizes in the following tests also vary between 1 byte and 256k in size.

Findings

If the dataset fits completely in cache, the default configuration, which uses the onheap LruBlockCache, performs best.  GC is half that of the next most performant deploy type, CombinedBlockCache:Offheap with at least 20% more throughput.

Otherwise, if your cache is experiencing churn running a steady stream of evictions, move your block cache offheap using CombinedBlockCache in the offheap mode. See the BlockCache section in the HBase Reference Guide for how to enable this deploy. Offheap mode requires only one third to one half of the GC of LruBlockCache when evictions are happening. There is also less variance as the cache misses climb when offheap is deployed. CombinedBlockCache:file mode has better GC profile but less throughput than CombinedBlockCache:Offheap. Latency is slightly higher on CBC:Offheap -- heap objects to cache have to be serialized in and out of the offheap memory -- than with the default LruBlockCache but the 95/99th percentiles are slightly better.  CBC:Offheap uses a bit more CPU than LruBlockCache. CombinedBlockCache:Offheap is limited only by the amount of RAM available.  It is not subject to GC.

Test

The graphs to follow show results from five different deploys each run through four different loadings.

Five Deploy Variants

  1. LruBlockCache The default BlockCache in place when you start up an unconfigured HBase install. With LruBlockCache, all blocks are loaded into the java heap. See BlockCache 101 and LruBlockCache for detail on the caching policy of LruBlockCache.
  2. CombinedBlockCache:Offheap CombinedBlockCache deploys two tiers; an L1 which is an LruBlockCache instance to hold META blocks only (i.e. INDEX and BLOOM blocks), and an L2 tier which is an instance of BucketCache. In this offheap mode deploy, the BucketCache uses DirectByteBuffers to host a BlockCache outside of the JVM heap to host cached DATA blocks.
  3. CombinedBlockCache:Onheap In this onheap ('heap' mode), the L2 cache is hosted inside the JVM heap, and appears to the JVM to be a single large allocation. Internally it is managed by an instance of BucketCache The L1 cache is an instance of LruBlockCache.
  4. CombinedBlockCache:file In this mode, an L2 BucketCache instance puts DATA blocks into a file (hence 'file' mode) on a mounted tmpfs in this case.
  5. CombinedBlockCache:METAonly No caching of DATA blocks (no L2 instance). DATA blocks are fetched every time. The INDEX blocks are loaded into an L1 LruBlockCache instance.

Memory is fixed for each deploy. The java heap is a small 8G to bring on distress earlier. For deploy type 1., the LruBlockCache is given 4G of the 8G JVM heap. For 2.-5. deploy types., the L1 LruHeapCache is 0.1 * 8G (~800MB), which is more than enough to host the dataset META blocks. This was confirmed by inspecting the Block Cache vitals displayed in the RegionServer UI. For 2., 3., and 4. deploy types, the L2 bucket cache is 4G. Deploy type 5.'s L2 is 0G (Used HBASE-11581 Add option so CombinedBlockCache L2 can be null (fscache)).

Four Loading Types

  1. All cache hits all the time.
  2. A small percentage of cache misses, with all misses inside the fscache soon after the test starts.
  3. Lots of cache misses, all still inside the fscache soon after the test starts.
  4. Mostly cache misses where many misses by-pass the fscache.

For each loading, we ran 25 clients reading randomly over 10M cells of zipfian varied cell sizes from 1 byte to 256k bytes over 21 regions hosted by a single RegionServer for 20 minutes. Clients would stay inside the cache size for loading 1., miss the cache at just under ~50% for loading 2., and so on.

font-family: Times; font-size: medium; margin-bottom: 0in; line-height: 18.239999771118164px;">The dataset was hosted on a single RegionServer on small HDFS cluster of 5 nodes.  See below for detail on the setup.

Graphs

For each attribute -- GC, Throughput, i/o -- we have five charts across; one for each deploy.  Each graph can be divided into four parts; the first quarter has the all-in-cache loading running for 20minutes, followed by the loading that has some cache misses (for twenty minutes), through to loading with mostly cache misses in the final quarter.

To find out what each color represents, read the legend.  The legend is small in the below. To see a larger version, browse to this non-apache-roller version of this document and click on the images over there.

Concentrate on the first four graph types.  The BlockCache Stats and I/O graphs add little other than confirming that the loadings ran the same across the different BlockCache deploys with fscache cutting in to soak up the seeks soon after start for all but the last mostly-cache-miss loading cases.

GC

Be sure to check the y-axis in the graphs below (you can click on chart to get a larger version). All profiles look basically the same but a check of the y-axis will show that for all but the no-cache-misses case, CombinedBlockCache:Offheap, the second deploy type, has the best GC profile (less GC is better!).

The GC for the CombinedBLockCache:Offheap deploy mode looks to be climbing as the test runs.  See the Notes section on the end for further comment.

Image map

Throughput

CombinedBlockCache:Offheap is better unless there are few to no cache misses.  In that case, LruBlockCache shines (I missed why there is the step in the LruBlockCache all-in-cache section of the graph)

Image map

Latency

Image map

Latency when in-cache

Same as above except that we do not show the last loading -- we show the first three only -- since the last loading of mostly cache misses skews the diagrams such that it is hard to compare latency when we are inside cache (BlockCache and fscache).
Image map

Load

Try aggregating the system and user CPUs when comparing.
Image map

BlockCache Stats

Curves are mostly the same except for the cache-no-DATA-blocks case. It has no L2 deployed.

Image map

I/O

Read profiles are about the same across all deploys and tests with a spike at first until the fscache comes around and covers. The exception is the final mostly-cache-misses case.
Image map

Setup

Master branch. 2.0.0-SNAPSHOT, r69039f8620f51444d9c62cfca9922baffe093610.  Hadoop 2.4.1-SNAPSHOT.

5 nodes all the same with 48G and six disks.  One master and one regionserver, each on distinct nodes, with 21 regions of 10M rows of zipf varied size -- 0 to 256k -- created as follows:

$ export HADOOP_CLASSPATH=/home/stack/conf_hbase:`./hbase-2.0.0-SNAPSHOT/bin/hbase classpath`
$ nohup  ./hadoop-2.4.1-SNAPSHOT/bin/hadoop --config /home/stack/conf_hadoop org.apache.hadoop.hbase.PerformanceEvaluation --valueSize=262144 --valueZipf --rows=100000 sequentialWrite 100 &

Here is my loading script.  It performs 4 loadings: All in cache, just out of cache, a good bit out of cache and then mostly out of cache:

[stack@c2020 ~]$ more bin/bc_in_mem.sh
#!/bin/sh
HOME=/home/stack
testtype=$1
date=`date -u +"%Y-%m-%dT%H:%M:%SZ"
echo testtype=$testtype $date` >> nohup.out
HBASE_HOME=$HOME/hbase-2.0.0-SNAPSHOT
runtime=1200
clients=25
cycles=1000000
#for i in 38 76 304 1000; do
for i in 32 72 144 1000; do
  echo "`date` run size=${i}, clients=$clients ; $testtype time=$runtime size=$i" >> nohup.out
  timeout $runtime nohup ${HBASE_HOME}/bin/hbase --config /home/stack/conf_hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --valueSize=110000 --size=$i --cycles=$cycles randomRead $clients
done
Java version:
[stack@c2020 ~]$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

I enabled caching options by uncommenting these configuration properties in hbase-site.xml . The hfile.block.cache.size was set to .5 to keep math simple.
<!--LRU Cache-->
<property>
<name>hfile.block.cache.size</name>
<value>0.5</value>
</property>

<!--Bucket cache-->
<property>
  <name>hbase.bucketcache.ioengine</name>
  <value>heap</value>
</property>
<property>
  <name>hbase.bucketcache.size</name>
  <value>4196</value>
</property>

Notes

The CombinedBlockCache has some significant overhead (to be better quantified -- 10%?). There will be slop left over because buckets will not fill to the brim especially when block sizes vary.  Observed running loadings.

I tried to bring on a FullGC by running for more than a day with LruBlockCache fielding mostly BlockCache misses and I failed. Thinking on it, the BlockCache only occupied 4G of an 8G heap. There was always elbow room available (Free block count and maxium block size allocatable settled down to a nice number and held constant). TODO: Retry but with less elbow room available.

Longer running CBC offheap test

Some of the tests above show disturbing GC tendencies -- ever rising GC -- so I ran tests for a longer period .  It turns out that BucketCache throws an OOME in long-running tests. You need HBASE-11678 BucketCache ramCache fills heap after running a few hours (fixed in hbase-0.98.6+) After the fix, the below test ran until the cluster was pul led out from under the loading:
gc over three daysThro    ughput over three days
Here are some long running tests with bucket cache onheap (ioengine=heap). The throughput is less and GC is higher.
gc over three days onheapThroughput over three days onheap
.


Future

Does LruBlockCache get more erratic as heap size grows?  Nick's post implies that it does.
Auto-sizing of L1, onheap cache.

HBASE-11331 [blockcache] lazy block decompression has a report attachedthat evaluates the current state of attached patch to keep blocks compressed while in the BlockCache.  It finds that with the patch enabled, "More GC. More CPU. Slower. Less throughput. But less i/o.", but there is an issue with compression block pooling that likely impinges on performance.

Review serialization in and out of L2 cache. No serialization?  Because we can take blocks from HDFS without copying onheap?

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation