Apache HBase

Tuesday May 12, 2015

The HBase Request Throttling Feature

Govind Kamat, HBase contributor and Cloudera Performance Engineer

(Edited on 5/14/2015 -- changed ops/sec/RS to ops/sec.)



Running multiple workloads on HBase has always been challenging, especially  when trying to execute real-time workloads while concurrently running analytical jobs. One possible way to address this issue is to throttle analytical MR jobs so that real-time workloads are less affected


A new QoS (quality of service) feature that Apache HBase 1.1 introduces is request-throttling, which controls the rate at which requests get handled by a HBase cluster.   HBase typically treats all requests identically; however, the new throttling feature can be used to specify a maximum rate or bandwidth to override this behavior.  The limit may be applied to a requests originating from a particular user, or alternatively, to requests directed to a given table or a specified namespace.


The objective of this post is to evaluate the effectiveness of this feature and the overhead it might impose on a running HBase workload.  The performance runs carried out showed that throttling works very well, by redirecting resources from a user whose workload is throttled to the workloads of other users, without incurring a significant overhead in the process.


Enabling Request Throttling

It is straightforward to enable the request-throttling feature -- all that is necessary is to set the HBase configuration parameter hbase.quota.enabled to true.  The related parameter hbase.quota.refresh.period  specifies the time interval in milliseconds that that regionserver should re-check for any new restrictions that have been added.


The throttle can then be set from the HBase shell, like so:


hbase> set_quota TYPE => THROTTLE, USER => 'uname', LIMIT => '100req/sec'

hbase> set_quota TYPE => THROTTLE, TABLE => 'tbl', LIMIT => '10M/sec'

hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns', LIMIT => 'NONE'


Test Setup

To evaluate how effectively HBase throttling worked, a YCSB workload was imposed on a 10 node cluster.  There were 6 regionservers and 2 master nodes.  YCSB clients were run on the 4 nodes that were not running regionserver processes.  The client processes were initiated by two separate users and the workload issued by one of them was throttled.

More details on the test setup follow.


HBase version: HBase 0.98.6-cdh5.3.0-SNAPSHOT (HBASE-11598 was backported to this version)


Configuration:

CentOS release 6.4 (Final)                                                

CPU sockets: 2                                                            

Physical cores per socket: 6

Total number of logical cores: 24

Number of disks: 12

Memory: 64 GB

Number of RS: 6

Master nodes: 2  (for the Namenode, Zookeeper and HBase master)

Number of client nodes: 4

Number of rows: 1080M

Number of regions: 180

Row size: 1K

Threads per client: 40

Workload: read-only and scan

Key distribution: Zipfian

Run duration: 1 hour


Procedure

An initial data set was first generated by running YCSB in its data generation mode.  A HBase table was created with the table specifications above and pre-split.  After all the data was inserted, the table was flushed, compacted and saved as a snapshot.  This data set was used to prime the table for each run.  Read-only and scan workloads were used to evaluate performance; this eliminates effects such as memstore flushes and compactions.  One run with a long duration was carried out first to ensure the caches were warmed and that the runs yielded repeatable results.


For the purpose of these tests, the throttle was applied to the workload emanating from one user in a two-user scenario. There were four client machines used to impose identical read-only workloads.  The client processes on two machines were run by the user “jenkins”, while those on the other two were run as a different user.   The throttle was applied to the workload issued by this second user.  There were two sets of runs, one with both users running read workloads and the second where the throttled user ran a scan workload.  Typically, scans are long running and it can be desirable on occasion to de-prioritize them in favor of more real-time read or update workloads.  In this case, the scan was for sets of 100 rows per YCSB operation.


For each run, the following steps were carried out:

  • Any existing YCSB-related table was dropped.

  • The initial data set was cloned from the snapshot.

  • The desired throttle setting was applied.

  • The desired workloads were imposed from the client machines.

  • Throughput and latency data was collected and is presented in the table below.


The throttle was applied at the start of the job (the command used was the first in the list shown in the “Enabling Request Throttling” section above).  The hbase.quota.refresh.period property was set to under a minute so that the throttle took effect by the time test setup was finished.

The throttle option specifically tested here was the one to limit the number of requests (rather than the one to limit bandwidth).


Observations and Results

The throttling feature appears to work quite well.  When applied interactively in the middle of a running workload, it goes into effect immediately after the the quota refresh period and can be observed clearly in the throughput numbers put out by YCSB while the test is progressing.  The table below has performance data from test runs indicating the impact of the throttle.  For each row, the throughput and latency numbers are also shown in separate columns, one set for the “throttled” user (indicated by “T” for throttled) and the other for the “non-throttled” user (represented by “U” for un-throttled).


Read + Read Workload


Throttle (req/sec)

Avg Total Thruput (ops/sec)

Thruput_U (ops/sec)

Thruput_T (ops/sec)

Latency_U (ms)

Latency_T (ms)

none

7291

3644

3644

21.9

21.9

2500 rps

7098

4700

2400

17

33.3

2000 rps

7125

4818

2204

16.6

34.7

1500 rps

6990

5126

1862

15.7

38.6

1000 rps

7213

5340

1772

14.9

42.7

500 rps

6640

5508

1136

14

70.2

image(2).pngimage.png


As can be seen, when the throttle pressure is increased (by reducing the permitted throughput for user “T” from 2500 req/sec to 500 req/sec, as shown in column 1), the total throughput (column 2) stays around the same.  In other words, the cluster resources get redirected to benefit the non-throttled user, with the feature consuming no significant overhead.  One possible outlier is the case where the throttle parameter is at its most restrictive (500 req/sec), where the total throughput is about 10% less than the maximum cluster throughput.


Correspondingly, the latency for the non-throttled user improves while that for the throttled user degrades.  This is shown in the last two columns in the table.


The charts above show that the change in throughput is linear with the amount of throttling, for both the throttled and non-throttled user.  With regard to latency, the change is generally linear, until the throttle becomes very restrictive; in this case, latency for the throttled user degrades substantially.


One point that should be noted is that, while the throttle parameter in req/sec is indeed correlated to the actual restriction in throughput as reported by YCSB (ops/sec) as seen by the trend in column 4, the actual figures differ.  As user “T”’s throughput is restricted down from 2500 to 500 req/sec, the observed throughput goes down from 2500 ops/sec to 1136 ops/sec.  Therefore, users should calibrate the throttle to their workload to determine the appropriate figure to use (either req/sec or MB/sec) in their case.


Read + Scan Workload


Throttle (req/sec)

Thruput_U (ops/sec)

Thruput_T (ops/sec)

Latency_U (ms)

Latency_T (ms)

3000 Krps

3810

690

20.9

115

1000 Krps

4158

630

19.2

126

500 Krps

4372

572

18.3

139

250 Krps

4556

510

17.5

156

50 Krps

5446

330

14.7

242


image(1).pngimage.png


image(2).pngWith the read/scan workload, similar results are observed as in the read/read workload.  As the extent of throttling is increased for the long-running scan workload, the observed throughput decreases and latency increases.  Conversely, the read workload benefits. displaying better throughput and improved latency.  Again, the specific numeric value used to specify the throttle needs to be calibrated to the workload at hand.  Since scans break down into a large number of read requests, the throttle parameter needs to be much higher than in the case with the read workload.  Shown above is a log-linear chart of the impact on throughput of the two workloads when the extent of throttling is adjusted.

Conclusion

HBase request throttling is an effective and useful technique to handle multiple workloads, or even multi-tenant workloads on an HBase cluster.  A cluster administrator can choose to throttle long-running or lower-priority workloads, knowing that regionserver resources will get re-directed to the other workloads, without this feature imposing a significant overhead.  By calibrating the throttle to the cluster and the workload, the desired performance can be achieved on clusters running multiple concurrent workloads.

Friday May 01, 2015

Scan Improvements in HBase 1.1.0

Jonathan Lawlor, Apache HBase Contributor


Over the past few months there have a been a variety of nice changes made to scanners in HBase. This post focuses on two such changes, namely RPC chunking (HBASE-11544) and scanner heartbeat messages (HBASE-13090). Both of these changes address long standing issues in the client-server scan protocol. Specifically, RPC chunking deals with how a server handles the scanning of very large rows and scanner heartbeat messages allow scan operations to progress even when aggressive server-side filtering makes infrequent result returns.


Background

In order to discuss these issues, lets first gain a general understanding of how scans currently work in HBase.


From an application's point of view, a ResultScanner is the client side source of all of the Results that an application asked for. When a client opens a scan, it’s a ResultScanner that is returned and it is against this object that the client invokes next to fetch more data. ResultScanners handle all communication with the RegionServers involved in a scan and the ResultScanner decides which Results to make visible to the application layer. While there are various implementations of the ResultScanner interface, all implementations use basically the same protocol to communicate with the server.


In order to retrieve Results from the server, the ResultScanner will issue ScanRequests via RPC's to the different RegionServers involved in a scan. A client configures a ScanRequest by passing an appropriately set Scan instance when opening the scan setting start/stop rows, caching limits, the maximum result size limit, and the filters to apply.


On the server side, there are three main components involved in constructing the ScanResponse that will be sent back to the client in answer to a ScanRequest:


RSRpcService

The RSRpcService is a service that lives on RegionServers that can respond to incoming RPC requests, such as ScanRequests. During a scan, the RSRpcServices is the server side component that is responsible for constructing the ScanResponse that will be sent back to the client. The RSRpcServices continues to scan in order to accumulate Results until the region is exhausted, the table is exhausted, or a scan limit is reached (such as the caching limit or max result size limit). In order to retrieve these Results, the RSRpcServices must talk to a RegionScanner


RegionScanner

The RegionScanner is the server side component responsible for scanning the different rows in the region. In order to scan through the rows in the region, the RegionScanner will talk to a one or more different instances of StoreScanners (one per column family in the row). If the row passes all filtering criteria, the RegionScanner will return the Cells for that row to the RSRpcServices so that they can be used to form a Result to include in the ScanResponse.


StoreScanner

The StoreScanner is the server side component responsible for scanning through the Cells in each column family.


When the client (i.e. ResultScanner) receives the ScanResponse back from the server, it can then decide whether or not scanning should continue. Since the client and the server communicate back and forth in chunks of Results, the client-side ResultScanner will cache all the Results it receives from the server. This allows the application to perform scans faster than the case where an RPC is required for each Result that the application sees.


RPC Chunking (HBASE-11544)

Why is it necessary?

Currently, the server sends back whole rows to the client (each Result contains all of the cells for that row). The max result size limit is only applied at row boundaries. After each full row is scanned, the size limit will be checked.


The problem with this approach is that it does not provide a granular enough restriction. Consider, for example, the scenario where each row being scanned is 100 MB. This means that 100 MB worth of data will be read between checks for the size limit. So, even in the case that the client has specified that the size limit is 1 MB, 100 MB worth of data will be read and then returned to the client.


This approach for respecting the size limit is problematic. First of all, it means that it is possible for the server to run out of memory and crash in the case that it must scan a large row. At the application level, you simply want to perform a scan without having to worry about crashing the region server.


This scenario is also problematic because it means that we do not have fine grained control over the network resources. The most efficient use of the network is to have the client and server talk back and forth in fixed sized chunks.


Finally, this approach for respecting the size limit is a problem because it can lead to large, erratic allocations server side playing havoc with GC.


Goal of the RPC Chunking solution

The goal of the RPC Chunking solution was to:

- Create a workflow that is more ‘regular’, less likely to cause Region Server distress

- Use the network more efficiently

- Avoid large garbage collections caused by allocating large blocks


Furthermore, we wanted the RPC chunking mechanism to be invisible to the application. The communication between the client and the server should not be a concern of the application. All the application wants is a Result for their scan. How that Result is retrieved is completely up to the protocol used between the client and the server.


Implementation Details

The first step in implementing this proposed RPC chunking method was to move the max result size limit to a place where it could be respected at a more granular level than on row boundaries. Thus, the max result size limit was moved down to the cell level within StoreScanner. This meant that now the max result size limit could be checked at the cell boundaries, and if in excess, the scan can return early.


The next step was to consider what would occur if the size limit was reached in the middle of a row. In this scenario, the Result sent back to the client will be marked as a "partial" Result. By marking the Result with this flag, the server communicates to the client that the Result is not a complete view of the row. Further Scan RPC's must be made in order to retrieve the outstanding parts of the row before it can be presented to the application. This allows the server to send back partial Results to the client and then the client can handle combining these partials into "whole" row Results. This relieves the server of the burden of having to read the entire row at a time.


Finally, these changes were performed in a backwards compatible manner. The client must indicate in its ScanRequest that partial Results are supported. If that flag is not seen server side, the server will not send back partial results. Note that the configuration hbase.client.scanner.max.result.size can be used to change the default chunk size. By default this value is 2 MB in HBase 1.1+.


An option (Scan#setAllowPartialResults) was also added so that an application can ask to see partial results as they are returned rather than wait on the aggregation of complete rows.


A consistent view on the row is maintained even though a row is the result of multiple RPC partials because the running context server-side keeps account of the outstanding mvcc read point and will not include in results Cells written later.


Note that this feature will be available starting in HBase 1.1. All 1.1+ clients will chunk their responses.


Heartbeat messages for scans (HBASE-13090)

What is a heartbeat message?

A heartbeat message is a message used to keep the client-server connection alive. In the context of scans, a heartbeat message allows the server to communicate back to the client that scan progress has been made. On receipt, the client resets the connection timeout. Since the client and the server communicate back and forth in chunks of Results, it is possible that a single progress update will contain ‘enough’ Results to satisfy the requesting application’s purposes. So, beyond simply keeping the client-server connection alive and preventing scan timeouts, heartbeat messages also give the calling application an opportunity to decide whether or not more RPC's are even needed.


Why are heartbeat messages necessary in Scans?

Currently, scans execute server side until the table is exhausted, the region is exhausted, or a limit is reached. As such there is no way of influencing the actual execution time of a Scan RPC. The server continues to fetch Results with no regard for how long that process is taking. This is particularly problematic since each Scan RPC has a defined timeout. Thus, in the case that a particular scan is causing timeouts, the only solution is to increase the timeout so it spans the long-running requests (hbase.client.scanner.timeout.period). You may have encountered such problems if you have seen exceptions such as OutOfOrderScannerNextException.


Goal of heartbeat messages

The goal of the heartbeating message solution was to:

- Incorporate a time limit concept into the client-server Scan protocol

- The time limit must be enforceable at a granular level (between cells)

- Time limit enforcement should be tunable so that checks do not occur too often

Beyond simply meeting these goals, it is also important that, similar to RPC chunking, this mechanism is invisible to the application layer. The application simply wants to perform its scan and get a Result back on a call to ResultScanner.next(). It does not want to worry about whether or not the scan is going to take too long and if it needs to adjust timeouts or scan sizings.


Implementation details

The first step in implementing heartbeat messages was to incorporate a time limit concept into the Scan RPC workflow. This time limit was based on the configured value of the client scanner timeout.


Once a time limit was defined, we had to decide where this time limit should be respected. It was decided that this time limit should be enforced within the StoreScanner. The reason for enforcing this limit inside store scanner was that it allowed the time limit to be enforced at the cell boundaries. It is important that the time limit be checked at a fine grained location because, in the case of restrictive filtering or time ranges, it is possible that large portions of time will be spent filtering out and skipping cells. If we wait to check the time limit at the row boundaries, it is possible when the row is wide that we may timeout before a single check occurs.


A new configuration was introduced (hbase.cells.scanned.per.heartbeat.check) to control how often these time limit checks occur. The default value of this configuration is 10,000 meaning that we check the time limit every 10,000 cells.


Finally, if this time limit is reached, the ScanResponse sent back to the client is marked as a heartbeat message. It is important that the ScanResponse be marked in this way because it communicates to the client the reason the message was sent back from the server. Since a time limit may be reached before any Results are retrieved, it is possible that the heartbeat message's list of Results will be empty. We do not want the client to interpret this as meaning that the region was exhausted.


Note that similar to RPC chunking, this feature was implemented in a fully backwards compatible manner. In other words, heartbeat messages will only be sent back to the client in the case that the client has specified that it can recognize when a ScanResponse is a heartbeat message. Heartbeat messages will be available starting in HBase 1.1. All HBase 1.1+ clients will heartbeat.


Cleaning up the Scan API (HBASE-13441)

Following these recent improvements to the client-server Scan protocol, there is now an effort to try and cleanup the Scan API. There are some API’s that are getting stale, and don’t make much sense anymore especially in light of the above changes. There are also some API’s that could use some love with regards to documentation.


For example the caching and max result size API’s deal with how data is transferred between the client and the server. Both of these API’s now seem misplaced in the Scan API. These are details that should likely be controlled by the RegionServer rather than the application. It seems much more appropriate to give the RegionServer control over these parameters so that it can tune them based on the current state of the RPC pipeline and server loadings.


HBASE-13441 Scan API Improvements is the open umbrella issue covering ideas for Scan API improvements. If you have some time, check it out. Let us know if you have some ideas for improvements that you would like to see.


Other recent Scanner improvements

It’s important to note that these are not the only improvements that have been made to Scanners recently. Many other impressive improvements have come through deserving of their own blog post. For those that are interested, I have listed two:

  • HBASE-13109: Make better SEEK vs SKIP decisions during scanning

  • HBASE-13262: Fixed a data loss issue in Scans


Upshot

A spate of Scan improvements should make for a smoother Scan experience in HBase 1.1.

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation