Apache HBase

Wednesday Jun 10, 2015

Scalable Distributed Transactional Queues on Apache HBase

This is the first in a series of posts on "Why We Use Apache HBase", in which we let HBase users and developers borrow our blog so they can showcase their successful HBase use cases, talk about why they use HBase, and discuss what worked and what didn't.

Today's entry is a guest post by Terence Yim, a Software Engineer at Cask, responsible for designing and building realtime processing systems on Hadoop/HBase, originally published here on the Cask engineering blog.

- Andrew Purtell


A real time stream processing framework usually involves two fundamental constructs: processors and queues. A processor reads events from a queue, executes user code to process them, and optionally writing events to another queue for additional downstream processors to consume. Queues are provided and managed by the framework. Queues transfer data and act as a buffer between processors, so that the processors can operate and scale independently. For example, a web server access log analytics application can look like this:

queue_perf.png

One key differentiator among various frameworks is the queue semantics, which commonly varies along these lines:

  • Delivery Guarantee: At least once, at most once, exactly-once.
  • Fault Tolerance: Failures are transparent to user and automatic recovery.
  • Durability: Data can survive failures and restarts.
  • Scalability: Characteristics and limitations when adding more producers/consumers.
  • Performance: Throughput and latency for queue operations.

In the open-source Cask Data Application Platform (CDAP), we wanted to provide a real-time stream processing framework that is dynamically scalable, strongly consistent and with an exactly-once delivery guarantee. With such strong guarantees, developers are free to perform any form of data manipulation without worrying about inconsistency, potential reprocessing or failure. It helps developers build their big data application even if they do not have strong distributed systems background. Moreover, it is possible to relax these strong guarantees to trade-off for higher performance if needed; it is always easier than doing it the other way around.

Queue Scalability

The basic operations that can be performed on a queue are enqueue and dequeue. Producers write data to the head of the queue (enqueue) and consumers read data from the tail of the queue (dequeue). We say a queue is scalable when you enqueue faster as a whole by adding more producers and dequeue faster as a whole by adding more consumers. Ideally, the scaling is linear, meaning doubling the amount of producers/consumers will double the rate of enqueue/dequeue and is only bounded by the size of the cluster. In order to support linear scalability for producers, the queue needs to be backed by a storage system that scales linearly with the number of concurrent writers. For consumers to be linearly scalable, the queue can be partitioned such that each consumer only processes a subset of queue data.

Another aspect of queue scalability is that it should scale horizontally. This means the upper bound of the queue performance can be increased by adding more nodes to the cluster. It is important because it makes sure that the queue can keep working regardless of cluster size and can keep up with the growth in data volume.

Partitioned HBase Queue

We chose Apache HBase as the storage layer for the queue. It is designed and optimized for strong row-level consistency and horizontal scalability. It provides very good concurrent write performance and its support of ordered scans fits well for a partitioned consumer. We use the HBase Coprocessors for efficient scan filtering and queue cleanup. In order to have the exactly-once semantics on the queue, we use Tephra’s transaction support for HBase.

Producers and consumers operate independently. Each producer enqueues by performing batch HBase Puts and each consumer dequeues by performing HBase Scans. There is no link between the number of producers and consumers and they can scale separately.

The queue has a notion of consumer groups. A consumer group is a collection of consumers partitioned by the same key such that each event published to the queue is consumed by exactly one consumer within the group. The use of consumer groups allows you to partition the same queue with different keys and to scale independently based on the operational characteristics of the data. Using the access log analytics example above, the producer and consumer groups might look like this:

queue_groups.png

There are two producers running for the Log Parser and they are writing to the queue concurrently. On the consuming side, there are two consumer groups. The Unique User Counter group has two consumers, using UserID as the partitioning key and the Page View Counter group contains three consumers, using PageID as the partitioning key.

Queue rowkey format

Since an event emitted by a producer can be consumed by one or more consumer groups, we write the event to one or more rows in an HBase table, with one row designated for each consumer group. The event payload and metadata are stored in separate columns, while the row key follows this format:

queue_row_key.png

The two interesting parts of the row key are the Partition ID and the Entry ID. The Partition ID determines the row key prefix for a given consumer. This allows consumer to read only the data it needs to process using a prefix scan on the table during dequeue. The Partition ID consists of two parts: a Consumer Group ID and an Consumer ID. The producer computes one partition ID per consumer group and writes to those rows on enqueue.

The Entry ID in the row key contains the transaction information. It consists of the producer transaction write pointer issued by Tephra and a monotonic increasing counter. The counter is generated locally by the producer and is needed to make row key unique for the event since a producer can enqueue more than one event within the same transaction.

On dequeue, the consumer will use the transaction writer pointer to determine if that queue entry has been committed and hence can be consumed. The row key is always unique because of the inclusion of a transaction write pointer and counter. This makes producers operate independently and never have write conflicts.

In order to generate the Partition ID, a producer needs to know the size and the partitioning key of each consumer group. The consumer groups information is recorded transactionally when the application starts as well as when there are any changes in group size.

Changing producers and consumers

It is straightforward to increase or decrease producers since each producer operates independently. Adding or removing producer processes will do the job. However, when the size of consumer group needs to change, coordination is needed to update the consumer group information correctly. The steps can be summarized by this diagram:

queue_scale.png

Pausing and resuming are fast operations as they are coordinated using Apache ZooKeeper and executed in parallel. For example, with the web access log analytics application we mentioned above, changing the consumer groups information may look something like this:

queue_state.png

With this queue design, the enqueue and dequeue performance is on par with batch HBase Puts and HBase Scans respectively, with some overhead for talking to the Tephra server. That overhead can be greatly reduced by batching multiple events in the same transaction.

Finally, to prevent “hotspotting“, we pre-split the HBase table based on the cluster size and apply salting on the row key to better distribute writes which otherwise would have been sequential due to monotonically increasing transaction writepointer.

Performance Numbers

We’ve tested the performance on a small ten-node HBase cluster and the result is impressive. Using a 1K bytes payload with batch size of 500 events, we achieved a throughput of 100K events per second produced and consumed, running with three producers and ten consumers. We also observed the throughput increases linearly when we add more producers and consumers: for example, it increased to 200K events per second when we doubled the number of producers and consumers.

With the help of HBase and a combination of best practices, we successfully built a linearly scalable, distributed transactional queue system and used it to provide a real time stream processing framework in CDAP: dynamically scalable, strongly consistent, and with an exactly-once delivery guarantee.

Saturday Jun 06, 2015

Saving CPU! Using Native Hadoop Libraries for CRC computation in HBase

by Apekshit Sharma, HBase contributor and Cloudera Engineer


TL;DR Use Hadoop Native Library calculating CRC and save CPU!

Checksums in HBase

Checksum are used to check data integrity. HDFS computes and stores checksums for all files on write. One checksum is written per chunk of data (size can be configured using bytes.per.checksum) in a separate, companion checksum file. When data is read back, the file with corresponding checksums is read back as well and is used to ensure data integrity. However, having two files results in two disk seeks reading any chunk of data. For HBase, the extra seek while reading HFileBlock results in extra latency. To work around the extra seek, HBase inlines checksums. HBase calculates checksums for the data in a HFileBlock and appends them to the end of the block itself on write to HDFS (HDFS then checksums the HBase data+inline checksums). On read, by default HDFS checksum verification is turned off, and HBase itself verifies data integrity.



Can we then get rid of HDFS checksum altogether? Unfortunately no. While HBase can detect corruptions, it can’t fix them, whereas HDFS uses replication and a background process to detect and *fix* data corruptions if and when they happen. Since HDFS checksums generated at write-time are also available, we fall back to them when HBase verification fails for any reason. If the HDFS check fails too, the data is reported as corrupt.


The related hbase configurations are hbase.hstore.checksum.algorithm, hbase.hstore.bytes.per.checksum and hbase.regionserver.checksum.verify. HBase inline checksums are enabled by default.


Calculating checksums is computationally expensive and requires lots of CPU. When HDFS switched over to JNI + C for computing checksums, they witnessed big gains in CPU usage.


This post is about replicating those gains in HBase by using Native Hadoop Libraries (NHL). See HBASE-11927


Survey

We switched to use the Hadoop DataChecksum library which under-the-hood uses NHL if available, else we fall back to use the Java CRC implementation. Another alternative considered was the ‘Circe’ library. The following table highlights the differences with NHL and makes the reasoning for our choice clear.


Hadoop Native Library

Circe

Native code supports both crc32 and crc32c

Native code supports only crc32c

Adds dependency on hadoop-common which is reliable and actively developed

Adds dependency on external project

Interface supports taking in stream of data, stream of checksums, chunk size as parameters and compute/verify checksums  considering data in chunks.

Only supports calculation of single checksum for all input data.


Both libraries supported use of the special x86 instruction for hardware calculation of CRC32C if available (defined in SSE4.2 instruction set). In the case of NHL, hadoop-2.6.0 or newer version is required for HBase to get the native checksum benefit.


However, based on the data layout of HFileBlock, which has ‘real data’ followed by checksums on the end, only NHL supported the interface we wanted. Implementing the same in Circe would have been significant effort. So we chose to go with NHL.

Setup

Since the metric to be evaluated was CPU usage, a simple configuration of two nodes was used. Node1 was configured to be the NameNode, Zookeeper and HBase master. Node2 was configured to be DataNode and RegionServer. All real computational work was done on Node2 while Node1 remained idle most of the time. This isolation of work on a single node made it easier to measure impact on CPU usage.


Configuration

Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-24-generic x86_64)

CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz                                          

Socket(s) : 1

Core(s) per socket : 1

Thread(s) per core : 4

Logical CPU(s) : 4

Number of disks : 1

Memory : 8 GB

HBase Version/Distro *: 1.0.0 / CDH 5.4.0


*Since trunk resolves to hadoop-2.5.1 which does not have HDFS-6865, it was easier to use a CDH distro which already has HDFS-6865 backported.

Procedure

We chose to study the impact on major compactions, mainly because of the presence of CompactionTool which a) can be used offline, b) allowed us to profile only the relevant workload. PerformanceEvaluation (bin/hbase pe) was used to build a test table which was then copied to local disk for reuse.


% ./bin/hbase pe --nomapred --rows=150000 --table="t1" --valueSize=10240 --presplit=10 sequentialWrite 10


Table size: 14.4G

Number of rows: 1.5M

Number of regions: 10

Row size: 10K

Total store files across regions: 67


For profiling, Lightweight-java-profiler was used and FlameGraph was used to generate graphs.


For benchmarking, the linux ‘time’ command was used. Profiling was disabled during these runs. A script repeatedly executed following in order:

  1. delete hdfs:///hbase

  2. copy t1 from local disk hdfs:///hbase/data/default

  3. run compaction tool on t1 and time it

Observations


Profiling


CPU profiling of HBase not using NHL (figure 1) shows that about 22% cpu is used for generating and validating checksums, whereas, while using NHL (figure 2) it takes only about 3%.


Screen Shot 2015-06-02 at 8.14.07 PM.png

Figure 1: CPU profile - HBase not using NHL (svg)

Screen Shot 2015-06-02 at 8.03.39 PM.png

Figure 2: CPU profile - HBase using NHL (svg)


Benchmarking

Benchmarking was done for three different cases: (a) neither HBase nor HDFS use NHL, (b) HDFS uses NHL but not HBase, and (c) both HDFS and HBase use NHL. For each case, we did 5 runs. Observations from table 1:

  1. Within a case, while real time fluctuates across runs, user and sys times remain same. This is expected as compactions are IO bound.

  2. Using NHL only for HDFS reduces CPU usage by about 10% (A vs B)

  3. Further, using NHL for HBase checksums reduces CPU usage by about 23% (B vs C).


All times are in seconds. This stackoverflow answer provides a good explaination of real, user and sys times.


run #

no native for HDFS and HBase (A)

no native for HBase (B)

native (C)

1

real      469.4

user 110.8

sys 30.5

real    422.9

user    95.4

sys     30.5

real 414.6

user 67.5

sys 30.6

2

real 384.3

user 111.4

sys 30.4

real 400.5

user 96.7

sys 30.5

real 393.8

user     67.6

sys 30.6

3

real 400.7

user 111.5

sys 30.6

real 398.6

user 95.8

sys 30.6

real    392.0

user    66.9

sys     30.5

4

real 396.8

user 111.1

sys 30.3

real 379.5

user 96.0

sys 30.4

real    390.8

user    67.2

sys     30.5

5

real 389.1

user 111.6

sys 30.3

real 377.4

user 96.5

sys 30.4

real    381.3

user    67.6

sys     30.5

Table 1

times.png

Conclusion

Native Hadoop Library leverages the special processor instruction (if available) that does pipelining and other low level optimizations when performing CRC calculations. Using NHL in HBase for heavy checksum computation, allows HBase make use of this facility, saving significant amounts of CPU time checksumming.

Tuesday May 12, 2015

The HBase Request Throttling Feature

Govind Kamat, HBase contributor and Cloudera Performance Engineer

(Edited on 5/14/2015 -- changed ops/sec/RS to ops/sec.)



Running multiple workloads on HBase has always been challenging, especially  when trying to execute real-time workloads while concurrently running analytical jobs. One possible way to address this issue is to throttle analytical MR jobs so that real-time workloads are less affected


A new QoS (quality of service) feature that Apache HBase 1.1 introduces is request-throttling, which controls the rate at which requests get handled by a HBase cluster.   HBase typically treats all requests identically; however, the new throttling feature can be used to specify a maximum rate or bandwidth to override this behavior.  The limit may be applied to a requests originating from a particular user, or alternatively, to requests directed to a given table or a specified namespace.


The objective of this post is to evaluate the effectiveness of this feature and the overhead it might impose on a running HBase workload.  The performance runs carried out showed that throttling works very well, by redirecting resources from a user whose workload is throttled to the workloads of other users, without incurring a significant overhead in the process.


Enabling Request Throttling

It is straightforward to enable the request-throttling feature -- all that is necessary is to set the HBase configuration parameter hbase.quota.enabled to true.  The related parameter hbase.quota.refresh.period  specifies the time interval in milliseconds that that regionserver should re-check for any new restrictions that have been added.


The throttle can then be set from the HBase shell, like so:


hbase> set_quota TYPE => THROTTLE, USER => 'uname', LIMIT => '100req/sec'

hbase> set_quota TYPE => THROTTLE, TABLE => 'tbl', LIMIT => '10M/sec'

hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns', LIMIT => 'NONE'


Test Setup

To evaluate how effectively HBase throttling worked, a YCSB workload was imposed on a 10 node cluster.  There were 6 regionservers and 2 master nodes.  YCSB clients were run on the 4 nodes that were not running regionserver processes.  The client processes were initiated by two separate users and the workload issued by one of them was throttled.

More details on the test setup follow.


HBase version: HBase 0.98.6-cdh5.3.0-SNAPSHOT (HBASE-11598 was backported to this version)


Configuration:

CentOS release 6.4 (Final)                                                

CPU sockets: 2                                                            

Physical cores per socket: 6

Total number of logical cores: 24

Number of disks: 12

Memory: 64 GB

Number of RS: 6

Master nodes: 2  (for the Namenode, Zookeeper and HBase master)

Number of client nodes: 4

Number of rows: 1080M

Number of regions: 180

Row size: 1K

Threads per client: 40

Workload: read-only and scan

Key distribution: Zipfian

Run duration: 1 hour


Procedure

An initial data set was first generated by running YCSB in its data generation mode.  A HBase table was created with the table specifications above and pre-split.  After all the data was inserted, the table was flushed, compacted and saved as a snapshot.  This data set was used to prime the table for each run.  Read-only and scan workloads were used to evaluate performance; this eliminates effects such as memstore flushes and compactions.  One run with a long duration was carried out first to ensure the caches were warmed and that the runs yielded repeatable results.


For the purpose of these tests, the throttle was applied to the workload emanating from one user in a two-user scenario. There were four client machines used to impose identical read-only workloads.  The client processes on two machines were run by the user “jenkins”, while those on the other two were run as a different user.   The throttle was applied to the workload issued by this second user.  There were two sets of runs, one with both users running read workloads and the second where the throttled user ran a scan workload.  Typically, scans are long running and it can be desirable on occasion to de-prioritize them in favor of more real-time read or update workloads.  In this case, the scan was for sets of 100 rows per YCSB operation.


For each run, the following steps were carried out:

  • Any existing YCSB-related table was dropped.

  • The initial data set was cloned from the snapshot.

  • The desired throttle setting was applied.

  • The desired workloads were imposed from the client machines.

  • Throughput and latency data was collected and is presented in the table below.


The throttle was applied at the start of the job (the command used was the first in the list shown in the “Enabling Request Throttling” section above).  The hbase.quota.refresh.period property was set to under a minute so that the throttle took effect by the time test setup was finished.

The throttle option specifically tested here was the one to limit the number of requests (rather than the one to limit bandwidth).


Observations and Results

The throttling feature appears to work quite well.  When applied interactively in the middle of a running workload, it goes into effect immediately after the the quota refresh period and can be observed clearly in the throughput numbers put out by YCSB while the test is progressing.  The table below has performance data from test runs indicating the impact of the throttle.  For each row, the throughput and latency numbers are also shown in separate columns, one set for the “throttled” user (indicated by “T” for throttled) and the other for the “non-throttled” user (represented by “U” for un-throttled).


Read + Read Workload


Throttle (req/sec)

Avg Total Thruput (ops/sec)

Thruput_U (ops/sec)

Thruput_T (ops/sec)

Latency_U (ms)

Latency_T (ms)

none

7291

3644

3644

21.9

21.9

2500 rps

7098

4700

2400

17

33.3

2000 rps

7125

4818

2204

16.6

34.7

1500 rps

6990

5126

1862

15.7

38.6

1000 rps

7213

5340

1772

14.9

42.7

500 rps

6640

5508

1136

14

70.2

image(2).pngimage.png


As can be seen, when the throttle pressure is increased (by reducing the permitted throughput for user “T” from 2500 req/sec to 500 req/sec, as shown in column 1), the total throughput (column 2) stays around the same.  In other words, the cluster resources get redirected to benefit the non-throttled user, with the feature consuming no significant overhead.  One possible outlier is the case where the throttle parameter is at its most restrictive (500 req/sec), where the total throughput is about 10% less than the maximum cluster throughput.


Correspondingly, the latency for the non-throttled user improves while that for the throttled user degrades.  This is shown in the last two columns in the table.


The charts above show that the change in throughput is linear with the amount of throttling, for both the throttled and non-throttled user.  With regard to latency, the change is generally linear, until the throttle becomes very restrictive; in this case, latency for the throttled user degrades substantially.


One point that should be noted is that, while the throttle parameter in req/sec is indeed correlated to the actual restriction in throughput as reported by YCSB (ops/sec) as seen by the trend in column 4, the actual figures differ.  As user “T”’s throughput is restricted down from 2500 to 500 req/sec, the observed throughput goes down from 2500 ops/sec to 1136 ops/sec.  Therefore, users should calibrate the throttle to their workload to determine the appropriate figure to use (either req/sec or MB/sec) in their case.


Read + Scan Workload


Throttle (req/sec)

Thruput_U (ops/sec)

Thruput_T (ops/sec)

Latency_U (ms)

Latency_T (ms)

3000 Krps

3810

690

20.9

115

1000 Krps

4158

630

19.2

126

500 Krps

4372

572

18.3

139

250 Krps

4556

510

17.5

156

50 Krps

5446

330

14.7

242


image(1).pngimage.png


image(2).pngWith the read/scan workload, similar results are observed as in the read/read workload.  As the extent of throttling is increased for the long-running scan workload, the observed throughput decreases and latency increases.  Conversely, the read workload benefits. displaying better throughput and improved latency.  Again, the specific numeric value used to specify the throttle needs to be calibrated to the workload at hand.  Since scans break down into a large number of read requests, the throttle parameter needs to be much higher than in the case with the read workload.  Shown above is a log-linear chart of the impact on throughput of the two workloads when the extent of throttling is adjusted.

Conclusion

HBase request throttling is an effective and useful technique to handle multiple workloads, or even multi-tenant workloads on an HBase cluster.  A cluster administrator can choose to throttle long-running or lower-priority workloads, knowing that regionserver resources will get re-directed to the other workloads, without this feature imposing a significant overhead.  By calibrating the throttle to the cluster and the workload, the desired performance can be achieved on clusters running multiple concurrent workloads.

Friday May 01, 2015

Scan Improvements in HBase 1.1.0

Jonathan Lawlor, Apache HBase Contributor


Over the past few months there have a been a variety of nice changes made to scanners in HBase. This post focuses on two such changes, namely RPC chunking (HBASE-11544) and scanner heartbeat messages (HBASE-13090). Both of these changes address long standing issues in the client-server scan protocol. Specifically, RPC chunking deals with how a server handles the scanning of very large rows and scanner heartbeat messages allow scan operations to progress even when aggressive server-side filtering makes infrequent result returns.


Background

In order to discuss these issues, lets first gain a general understanding of how scans currently work in HBase.


From an application's point of view, a ResultScanner is the client side source of all of the Results that an application asked for. When a client opens a scan, it’s a ResultScanner that is returned and it is against this object that the client invokes next to fetch more data. ResultScanners handle all communication with the RegionServers involved in a scan and the ResultScanner decides which Results to make visible to the application layer. While there are various implementations of the ResultScanner interface, all implementations use basically the same protocol to communicate with the server.


In order to retrieve Results from the server, the ResultScanner will issue ScanRequests via RPC's to the different RegionServers involved in a scan. A client configures a ScanRequest by passing an appropriately set Scan instance when opening the scan setting start/stop rows, caching limits, the maximum result size limit, and the filters to apply.


On the server side, there are three main components involved in constructing the ScanResponse that will be sent back to the client in answer to a ScanRequest:


RSRpcService

The RSRpcService is a service that lives on RegionServers that can respond to incoming RPC requests, such as ScanRequests. During a scan, the RSRpcServices is the server side component that is responsible for constructing the ScanResponse that will be sent back to the client. The RSRpcServices continues to scan in order to accumulate Results until the region is exhausted, the table is exhausted, or a scan limit is reached (such as the caching limit or max result size limit). In order to retrieve these Results, the RSRpcServices must talk to a RegionScanner


RegionScanner

The RegionScanner is the server side component responsible for scanning the different rows in the region. In order to scan through the rows in the region, the RegionScanner will talk to a one or more different instances of StoreScanners (one per column family in the row). If the row passes all filtering criteria, the RegionScanner will return the Cells for that row to the RSRpcServices so that they can be used to form a Result to include in the ScanResponse.


StoreScanner

The StoreScanner is the server side component responsible for scanning through the Cells in each column family.


When the client (i.e. ResultScanner) receives the ScanResponse back from the server, it can then decide whether or not scanning should continue. Since the client and the server communicate back and forth in chunks of Results, the client-side ResultScanner will cache all the Results it receives from the server. This allows the application to perform scans faster than the case where an RPC is required for each Result that the application sees.


RPC Chunking (HBASE-11544)

Why is it necessary?

Currently, the server sends back whole rows to the client (each Result contains all of the cells for that row). The max result size limit is only applied at row boundaries. After each full row is scanned, the size limit will be checked.


The problem with this approach is that it does not provide a granular enough restriction. Consider, for example, the scenario where each row being scanned is 100 MB. This means that 100 MB worth of data will be read between checks for the size limit. So, even in the case that the client has specified that the size limit is 1 MB, 100 MB worth of data will be read and then returned to the client.


This approach for respecting the size limit is problematic. First of all, it means that it is possible for the server to run out of memory and crash in the case that it must scan a large row. At the application level, you simply want to perform a scan without having to worry about crashing the region server.


This scenario is also problematic because it means that we do not have fine grained control over the network resources. The most efficient use of the network is to have the client and server talk back and forth in fixed sized chunks.


Finally, this approach for respecting the size limit is a problem because it can lead to large, erratic allocations server side playing havoc with GC.


Goal of the RPC Chunking solution

The goal of the RPC Chunking solution was to:

- Create a workflow that is more ‘regular’, less likely to cause Region Server distress

- Use the network more efficiently

- Avoid large garbage collections caused by allocating large blocks


Furthermore, we wanted the RPC chunking mechanism to be invisible to the application. The communication between the client and the server should not be a concern of the application. All the application wants is a Result for their scan. How that Result is retrieved is completely up to the protocol used between the client and the server.


Implementation Details

The first step in implementing this proposed RPC chunking method was to move the max result size limit to a place where it could be respected at a more granular level than on row boundaries. Thus, the max result size limit was moved down to the cell level within StoreScanner. This meant that now the max result size limit could be checked at the cell boundaries, and if in excess, the scan can return early.


The next step was to consider what would occur if the size limit was reached in the middle of a row. In this scenario, the Result sent back to the client will be marked as a "partial" Result. By marking the Result with this flag, the server communicates to the client that the Result is not a complete view of the row. Further Scan RPC's must be made in order to retrieve the outstanding parts of the row before it can be presented to the application. This allows the server to send back partial Results to the client and then the client can handle combining these partials into "whole" row Results. This relieves the server of the burden of having to read the entire row at a time.


Finally, these changes were performed in a backwards compatible manner. The client must indicate in its ScanRequest that partial Results are supported. If that flag is not seen server side, the server will not send back partial results. Note that the configuration hbase.client.scanner.max.result.size can be used to change the default chunk size. By default this value is 2 MB in HBase 1.1+.


An option (Scan#setAllowPartialResults) was also added so that an application can ask to see partial results as they are returned rather than wait on the aggregation of complete rows.


A consistent view on the row is maintained even though a row is the result of multiple RPC partials because the running context server-side keeps account of the outstanding mvcc read point and will not include in results Cells written later.


Note that this feature will be available starting in HBase 1.1. All 1.1+ clients will chunk their responses.


Heartbeat messages for scans (HBASE-13090)

What is a heartbeat message?

A heartbeat message is a message used to keep the client-server connection alive. In the context of scans, a heartbeat message allows the server to communicate back to the client that scan progress has been made. On receipt, the client resets the connection timeout. Since the client and the server communicate back and forth in chunks of Results, it is possible that a single progress update will contain ‘enough’ Results to satisfy the requesting application’s purposes. So, beyond simply keeping the client-server connection alive and preventing scan timeouts, heartbeat messages also give the calling application an opportunity to decide whether or not more RPC's are even needed.


Why are heartbeat messages necessary in Scans?

Currently, scans execute server side until the table is exhausted, the region is exhausted, or a limit is reached. As such there is no way of influencing the actual execution time of a Scan RPC. The server continues to fetch Results with no regard for how long that process is taking. This is particularly problematic since each Scan RPC has a defined timeout. Thus, in the case that a particular scan is causing timeouts, the only solution is to increase the timeout so it spans the long-running requests (hbase.client.scanner.timeout.period). You may have encountered such problems if you have seen exceptions such as OutOfOrderScannerNextException.


Goal of heartbeat messages

The goal of the heartbeating message solution was to:

- Incorporate a time limit concept into the client-server Scan protocol

- The time limit must be enforceable at a granular level (between cells)

- Time limit enforcement should be tunable so that checks do not occur too often

Beyond simply meeting these goals, it is also important that, similar to RPC chunking, this mechanism is invisible to the application layer. The application simply wants to perform its scan and get a Result back on a call to ResultScanner.next(). It does not want to worry about whether or not the scan is going to take too long and if it needs to adjust timeouts or scan sizings.


Implementation details

The first step in implementing heartbeat messages was to incorporate a time limit concept into the Scan RPC workflow. This time limit was based on the configured value of the client scanner timeout.


Once a time limit was defined, we had to decide where this time limit should be respected. It was decided that this time limit should be enforced within the StoreScanner. The reason for enforcing this limit inside store scanner was that it allowed the time limit to be enforced at the cell boundaries. It is important that the time limit be checked at a fine grained location because, in the case of restrictive filtering or time ranges, it is possible that large portions of time will be spent filtering out and skipping cells. If we wait to check the time limit at the row boundaries, it is possible when the row is wide that we may timeout before a single check occurs.


A new configuration was introduced (hbase.cells.scanned.per.heartbeat.check) to control how often these time limit checks occur. The default value of this configuration is 10,000 meaning that we check the time limit every 10,000 cells.


Finally, if this time limit is reached, the ScanResponse sent back to the client is marked as a heartbeat message. It is important that the ScanResponse be marked in this way because it communicates to the client the reason the message was sent back from the server. Since a time limit may be reached before any Results are retrieved, it is possible that the heartbeat message's list of Results will be empty. We do not want the client to interpret this as meaning that the region was exhausted.


Note that similar to RPC chunking, this feature was implemented in a fully backwards compatible manner. In other words, heartbeat messages will only be sent back to the client in the case that the client has specified that it can recognize when a ScanResponse is a heartbeat message. Heartbeat messages will be available starting in HBase 1.1. All HBase 1.1+ clients will heartbeat.


Cleaning up the Scan API (HBASE-13441)

Following these recent improvements to the client-server Scan protocol, there is now an effort to try and cleanup the Scan API. There are some API’s that are getting stale, and don’t make much sense anymore especially in light of the above changes. There are also some API’s that could use some love with regards to documentation.


For example the caching and max result size API’s deal with how data is transferred between the client and the server. Both of these API’s now seem misplaced in the Scan API. These are details that should likely be controlled by the RegionServer rather than the application. It seems much more appropriate to give the RegionServer control over these parameters so that it can tune them based on the current state of the RPC pipeline and server loadings.


HBASE-13441 Scan API Improvements is the open umbrella issue covering ideas for Scan API improvements. If you have some time, check it out. Let us know if you have some ideas for improvements that you would like to see.


Other recent Scanner improvements

It’s important to note that these are not the only improvements that have been made to Scanners recently. Many other impressive improvements have come through deserving of their own blog post. For those that are interested, I have listed two:

  • HBASE-13109: Make better SEEK vs SKIP decisions during scanning

  • HBASE-13262: Fixed a data loss issue in Scans


Upshot

A spate of Scan improvements should make for a smoother Scan experience in HBase 1.1.

Thursday Apr 16, 2015

Come to HBaseCon2015!

Come learn about:hbasecon.com

  • A highly-trafficked HBase cluster with an uptime of sixteen months
  • An HBase deploy that spans three datacenters doing master-master replication between thousands of HBase nodes in each
  • Some nuggets on how Bigtable does it (and HBase could too)
  • How HBase is being used to record patient telemetry 24/7 on behalf of the Michael J. Fox Foundation to enable breakthroughs in Parkinson Disease research
  • How Pinterest and Microsoft are doing it in the cloud, how FINRA and Bloomberg are doing it out east, and how Rocketfuel, Yahoo! and Flipboard, etc., are representing the west

HBaseCon 2015 is the fourth annual edition of the community event for Apache HBase contributors, developers, admins, and users of all skill levels. The event is hosted and organized by Cloudera, with the Program Committee including leaders from across the HBase community. The conference will be one full day comprising general sessions in the morning, breakout sessions in the morning and afternoon, and networking opportunities throughout the day.

The agenda hase been posted here, hbasecon2015 agenda. You can register here.

Yours,

The HBaseCon Program Committee 

Thursday Mar 05, 2015

HBase ZK-less Region Assignment

ZK-less region assignment allows us to achieve greater scale as well as do faster startups and assignment. It is simpler and has less code, improves the speed at which assignments run so we can do faster rolling restarts. This feature will be on by default in HBase 2.0.0.[Read More]

Tuesday Feb 24, 2015

Start of a new era: Apache HBase 1.0

Past, present and future state of the community

Author: Enis Söztutar, Apache HBase PMC member and HBase-1.0.0 release manager

The Apache HBase community has released Apache HBase 1.0.0. Seven years in the making, it marks a major milestone in the Apache HBase project’s development, offers some exciting features and new API’s without sacrificing stability, and is both on-wire and on-disk compatible with HBase 0.98.x.

In this blog, we look at the past, present and future of Apache HBase project. 

Versions, versions, versions 

Before enumerating feature details of this release let’s take a journey into the past and how release numbers emerged. HBase started its life as a contrib project in a subdirectory of Apache Hadoop, circa 2007, and released with Hadoop. Three years later, HBase became a standalone top-level Apache project. Because HBase depends on HDFS, the community ensured that HBase major versions were identical and compatible with Hadoop’s major version numbers. For example, HBase 0.19.x worked with Hadoop 0.19.x, and so on.


However, the HBase community wanted to ensure that an HBase version can work with multiple Hadoop versions—not only with its matching major release numbers Thus, a new naming scheme was invented where the releases would start at the close-to-1.0 major version of 0.90, as show above in the timeline. We also took on an even-odd release number convention where releases with odd version numbers were “developer previews” and even-numbered releases were “stable” and ready for production. The stable release series included 0.90, 0.92, 0.94, 0.96 and 0.98 (See HBase Versioning for an overview).

After 0.98, we named the trunk version 0.99-SNAPSHOT, but we officially ran out of numbers! Levity aside, last year, the HBase community agreed that the project had matured and stabilized enough such that a 1.0.0 release was due. After three releases in the 0.99.x series of “developer previews” and six Apache HBase 1.0.0 release candidates, HBase 1.0.0 has now shipped! See the above diagram, courtesy of Lars George, for a timeline of releases. It shows each release line together with the support lifecycle, and any previous developer preview releases if any (0.99->1.0.0 for example).

HBase-1.0.0, start of a new era

The 1.0.0 release has three goals:

1) to lay a stable foundation for future 1.x releases;

2) to stabilize running HBase cluster and its clients; and

3) make versioning and compatibility dimensions explicit 

Including previous 0.99.x releases, 1.0.0 contains over 1500 jiras resolved. Some of the major changes are: 

API reorganization and changes

HBase’s client level API has evolved over the years. To simplify the semantics and to support and make it extensible and easier to use in the future, we revisited the API before 1.0. To that end, 1.0.0 introduces new APIs, and deprecates some of the commonly-used client side APIs (HTableInterface, HTable and HBaseAdmin).

We advise you to update your application to use the new style of APIs, since deprecated APIs will be removed in the future 2.x series of releases. For further guidance, please visit these two decks: http://www.slideshare.net/xefyr/apache-hbase-10-release and http://s.apache.org/hbase-1.0-api.

All Client side APIs are marked with the InterfaceAudience.Public class, indicating if a class/method is an official "client API" for HBase (See “11.1.1. HBase API Surface” in the HBase Refguide for more details on the Audience annotations). Going forward, all 1.x releases are planned to be API compatible for classes annotated as client public.

Read availability using timeline consistent region replicas

As part of phase 1, this release contains an experimental "Read availability using timeline consistent region replicas" feature. That is, a region can be hosted in multiple region servers in read-only mode. One of the replicas for the region will be primary, accepting writes, and other replicas will share the same data files. Read requests can be done against any replica for the region with backup RPCs for high availability with timeline consistency guarantees. See JIRA HBASE-10070 for more details.

Online config change and other forward ports from 0.89-fb branch

The 0.89-fb branch in Apache HBase was where Facebook used to post their changes. HBASE-12147 JIRA forward ported the patches which enabled reloading a subset of the server configuration without having to restart the region servers.

Apart from the above, there are hundreds of improvements, performance (improved WAL pipeline, using disruptor, multi-WAL, more off-heap, etc) and bug fixes and other goodies that are too long to list here. Check out the official release notes for a detailed overview. The release notes and the book also cover binary, source and wire compatibility requirements, supported Hadoop and Java versions, upgrading from 0.94, 0.96 and 0.98 versions and other important details.

HBase-1.0.0 is also the start of using “semantic versioning” for HBase releases. In short, future HBase releases will have MAJOR.MINOR.PATCH version with the explicit semantics for compatibility. The HBase book contains all the dimensions for compatibility and what can be expected between different versions.

What’s next

We have marked HBase-1.0.0 as the next stable version of HBase, meaning that all new users should start using this version. However, as a database, we understand that switching to a newer version might take some time. We will continue to maintain and make 0.98.x releases until the user community is ready for its end of life. 1.0.x releases as well as 1.1.0, 1.2.0, etc line of releases are expected to be released from their corresponding branches, while 2.0.0 and other major releases will follow when their time arrives.

Read replicas phase 2, per column family flush, procedure v2, SSD for WAL or column family data, etc are some of the upcoming features in the pipeline. 

Conclusion

Finally, the HBase 1.0.0 release has come a long way, with contributions from a very large group of awesome people and hard work from committers and contributors. We would like to extend our thanks to our users and all who have contributed to HBase over the years.

Keep HBase’ing!

Friday Aug 08, 2014

Comparing BlockCache Deploys

Comparing BlockCache Deploys

St.Ack on August 7th, 2014

A report comparing BlockCache deploys in Apache HBase for issue HBASE-11323 BucketCache all the time! Attempts to roughly equate five different deploys and compare how well they do under four different loading types that vary from no cache-misses through to missing cache most of the time. Makes recommendation on when to use which deploy.

Prerequisite

In Nick Dimiduk's BlockCache 101 blog post, he details the different options available in HBase. We test what remains after purging SlabCache and the caching policy implementation DoubleBlockCache which have been deprecated in HBase 0.98 and removed in trunk because of Nick's findings and that of others.

Nick varies the JVM Heap+BlockCache sizes AND dataset size.  This report keeps JVM Heap+BlockCache size constant and varies the dataset size only. Nick looks at the 99th percentile only.  This article looks at that, as well as GC, throughput and loadings. Cell sizes in the following tests also vary between 1 byte and 256k in size.

Findings

If the dataset fits completely in cache, the default configuration, which uses the onheap LruBlockCache, performs best.  GC is half that of the next most performant deploy type, CombinedBlockCache:Offheap with at least 20% more throughput.

Otherwise, if your cache is experiencing churn running a steady stream of evictions, move your block cache offheap using CombinedBlockCache in the offheap mode. See the BlockCache section in the HBase Reference Guide for how to enable this deploy. Offheap mode requires only one third to one half of the GC of LruBlockCache when evictions are happening. There is also less variance as the cache misses climb when offheap is deployed. CombinedBlockCache:file mode has better GC profile but less throughput than CombinedBlockCache:Offheap. Latency is slightly higher on CBC:Offheap -- heap objects to cache have to be serialized in and out of the offheap memory -- than with the default LruBlockCache but the 95/99th percentiles are slightly better.  CBC:Offheap uses a bit more CPU than LruBlockCache. CombinedBlockCache:Offheap is limited only by the amount of RAM available.  It is not subject to GC.

Test

The graphs to follow show results from five different deploys each run through four different loadings.

Five Deploy Variants

  1. LruBlockCache The default BlockCache in place when you start up an unconfigured HBase install. With LruBlockCache, all blocks are loaded into the java heap. See BlockCache 101 and LruBlockCache for detail on the caching policy of LruBlockCache.
  2. CombinedBlockCache:Offheap CombinedBlockCache deploys two tiers; an L1 which is an LruBlockCache instance to hold META blocks only (i.e. INDEX and BLOOM blocks), and an L2 tier which is an instance of BucketCache. In this offheap mode deploy, the BucketCache uses DirectByteBuffers to host a BlockCache outside of the JVM heap to host cached DATA blocks.
  3. CombinedBlockCache:Onheap In this onheap ('heap' mode), the L2 cache is hosted inside the JVM heap, and appears to the JVM to be a single large allocation. Internally it is managed by an instance of BucketCache The L1 cache is an instance of LruBlockCache.
  4. CombinedBlockCache:file In this mode, an L2 BucketCache instance puts DATA blocks into a file (hence 'file' mode) on a mounted tmpfs in this case.
  5. CombinedBlockCache:METAonly No caching of DATA blocks (no L2 instance). DATA blocks are fetched every time. The INDEX blocks are loaded into an L1 LruBlockCache instance.

Memory is fixed for each deploy. The java heap is a small 8G to bring on distress earlier. For deploy type 1., the LruBlockCache is given 4G of the 8G JVM heap. For 2.-5. deploy types., the L1 LruHeapCache is 0.1 * 8G (~800MB), which is more than enough to host the dataset META blocks. This was confirmed by inspecting the Block Cache vitals displayed in the RegionServer UI. For 2., 3., and 4. deploy types, the L2 bucket cache is 4G. Deploy type 5.'s L2 is 0G (Used HBASE-11581 Add option so CombinedBlockCache L2 can be null (fscache)).

Four Loading Types

  1. All cache hits all the time.
  2. A small percentage of cache misses, with all misses inside the fscache soon after the test starts.
  3. Lots of cache misses, all still inside the fscache soon after the test starts.
  4. Mostly cache misses where many misses by-pass the fscache.

For each loading, we ran 25 clients reading randomly over 10M cells of zipfian varied cell sizes from 1 byte to 256k bytes over 21 regions hosted by a single RegionServer for 20 minutes. Clients would stay inside the cache size for loading 1., miss the cache at just under ~50% for loading 2., and so on.

font-family: Times; font-size: medium; margin-bottom: 0in; line-height: 18.239999771118164px;">The dataset was hosted on a single RegionServer on small HDFS cluster of 5 nodes.  See below for detail on the setup.

Graphs

For each attribute -- GC, Throughput, i/o -- we have five charts across; one for each deploy.  Each graph can be divided into four parts; the first quarter has the all-in-cache loading running for 20minutes, followed by the loading that has some cache misses (for twenty minutes), through to loading with mostly cache misses in the final quarter.

To find out what each color represents, read the legend.  The legend is small in the below. To see a larger version, browse to this non-apache-roller version of this document and click on the images over there.

Concentrate on the first four graph types.  The BlockCache Stats and I/O graphs add little other than confirming that the loadings ran the same across the different BlockCache deploys with fscache cutting in to soak up the seeks soon after start for all but the last mostly-cache-miss loading cases.

GC

Be sure to check the y-axis in the graphs below (you can click on chart to get a larger version). All profiles look basically the same but a check of the y-axis will show that for all but the no-cache-misses case, CombinedBlockCache:Offheap, the second deploy type, has the best GC profile (less GC is better!).

The GC for the CombinedBLockCache:Offheap deploy mode looks to be climbing as the test runs.  See the Notes section on the end for further comment.

Image map

Throughput

CombinedBlockCache:Offheap is better unless there are few to no cache misses.  In that case, LruBlockCache shines (I missed why there is the step in the LruBlockCache all-in-cache section of the graph)

Image map

Latency

Image map

Latency when in-cache

Same as above except that we do not show the last loading -- we show the first three only -- since the last loading of mostly cache misses skews the diagrams such that it is hard to compare latency when we are inside cache (BlockCache and fscache).
Image map

Load

Try aggregating the system and user CPUs when comparing.
Image map

BlockCache Stats

Curves are mostly the same except for the cache-no-DATA-blocks case. It has no L2 deployed.

Image map

I/O

Read profiles are about the same across all deploys and tests with a spike at first until the fscache comes around and covers. The exception is the final mostly-cache-misses case.
Image map

Setup

Master branch. 2.0.0-SNAPSHOT, r69039f8620f51444d9c62cfca9922baffe093610.  Hadoop 2.4.1-SNAPSHOT.

5 nodes all the same with 48G and six disks.  One master and one regionserver, each on distinct nodes, with 21 regions of 10M rows of zipf varied size -- 0 to 256k -- created as follows:

$ export HADOOP_CLASSPATH=/home/stack/conf_hbase:`./hbase-2.0.0-SNAPSHOT/bin/hbase classpath`
$ nohup  ./hadoop-2.4.1-SNAPSHOT/bin/hadoop --config /home/stack/conf_hadoop org.apache.hadoop.hbase.PerformanceEvaluation --valueSize=262144 --valueZipf --rows=100000 sequentialWrite 100 &

Here is my loading script.  It performs 4 loadings: All in cache, just out of cache, a good bit out of cache and then mostly out of cache:

[stack@c2020 ~]$ more bin/bc_in_mem.sh
#!/bin/sh
HOME=/home/stack
testtype=$1
date=`date -u +"%Y-%m-%dT%H:%M:%SZ"
echo testtype=$testtype $date` >> nohup.out
HBASE_HOME=$HOME/hbase-2.0.0-SNAPSHOT
runtime=1200
clients=25
cycles=1000000
#for i in 38 76 304 1000; do
for i in 32 72 144 1000; do
  echo "`date` run size=${i}, clients=$clients ; $testtype time=$runtime size=$i" >> nohup.out
  timeout $runtime nohup ${HBASE_HOME}/bin/hbase --config /home/stack/conf_hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --valueSize=110000 --size=$i --cycles=$cycles randomRead $clients
done
Java version:
[stack@c2020 ~]$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

I enabled caching options by uncommenting these configuration properties in hbase-site.xml . The hfile.block.cache.size was set to .5 to keep math simple.
<!--LRU Cache-->
<property>
<name>hfile.block.cache.size</name>
<value>0.5</value>
</property>

<!--Bucket cache-->
<property>
  <name>hbase.bucketcache.ioengine</name>
  <value>heap</value>
</property>
<property>
  <name>hbase.bucketcache.size</name>
  <value>4196</value>
</property>

Notes

The CombinedBlockCache has some significant overhead (to be better quantified -- 10%?). There will be slop left over because buckets will not fill to the brim especially when block sizes vary.  Observed running loadings.

I tried to bring on a FullGC by running for more than a day with LruBlockCache fielding mostly BlockCache misses and I failed. Thinking on it, the BlockCache only occupied 4G of an 8G heap. There was always elbow room available (Free block count and maxium block size allocatable settled down to a nice number and held constant). TODO: Retry but with less elbow room available.

Longer running CBC offheap test

Some of the tests above show disturbing GC tendencies -- ever rising GC -- so I ran tests for a longer period .  It turns out that BucketCache throws an OOME in long-running tests. You need HBASE-11678 BucketCache ramCache fills heap after running a few hours (fixed in hbase-0.98.6+) After the fix, the below test ran until the cluster was pul led out from under the loading:
gc over three daysThro    ughput over three days
Here are some long running tests with bucket cache onheap (ioengine=heap). The throughput is less and GC is higher.
gc over three days onheapThroughput over three days onheap
.


Future

Does LruBlockCache get more erratic as heap size grows?  Nick's post implies that it does.
Auto-sizing of L1, onheap cache.

HBASE-11331 [blockcache] lazy block decompression has a report attachedthat evaluates the current state of attached patch to keep blocks compressed while in the BlockCache.  It finds that with the patch enabled, "More GC. More CPU. Slower. Less throughput. But less i/o.", but there is an issue with compression block pooling that likely impinges on performance.

Review serialization in and out of L2 cache. No serialization?  Because we can take blocks from HDFS without copying onheap?

Friday Apr 11, 2014

The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size

By Doug Meil, HBase Committer and Thomas Murphy

Intro

One of the most common questions in the HBase user community is estimating disk footprint of tables, which translates into HFile size – the internal file format in HBase.

We designed an experiment at Explorys where we ran combinations of design time options (rowkey length, column name length, row storage approach) and runtime options (HBase ColumnFamily compression, HBase data block encoding options) to determine these factors’ effects on the resultant HFile size in HDFS.

HBase Environment

CDH4.3.0 (HBase 0.94.6.1)

Design Time Choices

  1. Rowkey

    1. Thin

      1. 16-byte MD5 hash of an integer.

    2. Fat

      1. 64-byte SHA-256 hash of an integer.

    1. Note: neither of these are realistic rowkeys for real applications, but they chosen because they are easy to generate and one is a lot bigger than the other.

  1. Column Names

    1. Thin

      1. 2-3 character column names (c1, c2).

    2. Fat

      1. 10 characters, randomly chosen but consistent for all rows.

    1. Note: it is advisable to have small column names, but most people don’t start that way so we have this as an option.

  1. Row Storage Approach

    1. Key Value Per Column

      1. This is the traditional way of storing data in HBase.

    2. One Key Value per row.

      1. Actually, two.

      2. One KV has an Avro serialized byte array containing all the data from the row.

      3. Another KV holds an MD5 hash of the version of the Avro schema.

Run Time

  1. Column Family Compression

    1. None

    2. GZ

    3. LZ4

    4. LZO

    5. Snappy


    1. Note: it is generally advisable to use compression, but what if you didn’t? So we tested that too.

  1. HBase Block Encoding

    1. None

    2. Prefix

    3. Diff

    4. Fast Diff

    1. Note: most people aren’t familiar with HBase Data Block Encoding. Primarily intended for squeezing more data into the block cache, it has effects on HFile size too. See HBASE-4218 for more detail.

1000 rows were generated for each combination of table parameters. Not a ton of data, but we don’t necessarily need a ton of data to see the varying size of the table. There were 30 columns per row comprised of 10 strings (each filled with 20 bytes of random characters), 10 integers (random numbers) and 10 longs (also random numbers).

HBase blocksize was 128k.


Results

The easiest way to navigate the results is to compare specific cases, progressing from an initial implementation of a table to options for production.

Case #1: Fat Rowkey and Fat Column Names, Now What?

This is where most people start with HBase. Rowkeys are not as optimal as they should be (i.e., the Fat rowkey case) and column names are also inflated (Fat column-names).

Without CF Compression or Data Block Encoding, the baseline is:

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY-FATCOL

6,293,670

1000

NONE

NONE

What if we just changed CF compression?

This drastically changes the HFile footprint. Snappy compression reduces the HFile size from 6.2 Mb to 1.8 Mb, for example.

1,362,033

1000

GZ

NONE

1,803,240

1000

SNAPPY

NONE

1,919,265

1000

LZ4

NONE

1,950,306

1000

LZO

NONE

However, we shouldn’t be too quick to celebrate. Remember that this is just the disk footprint. Over the wire the data is uncompressed, so 6.2 Mb is still being transferred from RegionServer to Client when doing a Scan over the entire table.

What if we just changed data block encoding?

Compression isn’t the only option though. Even without compression, we can change the data block encoding and also achieve HFile reduction. All options are an improvement over the 6.2 Mb baseline.


1,491,000

1000

NONE

DIFF

1,492,155

1000

NONE

FAST_DIFF

2,244,963

1000

NONE

PREFIX

Combination

The following table shows the results of all remaining CF compression / data block encoding combinations.

1,146,675

1000

GZ

DIFF

1,200,471

1000

GZ

FAST_DIFF

1,274,265

1000

GZ

PREFIX

1,350,483

1000

SNAPPY

DIFF

1,358,190

1000

LZ4

DIFF

1,391,016

1000

SNAPPY

FAST_DIFF

1,402,614

1000

LZ4

FAST_DIFF

1,406,334

1000

LZO

FAST_DIFF

1,541,151

1000

SNAPPY

PREFIX

1,597,440

1000

LZO

PREFIX

1,622,313

1000

LZ4

PREFIX

Case #2: What if we re-designed the column names (and left the rowkey alone)?

Let’s assume that we re-designed our column names but left the rowkey alone. After using the “thin” column-names without CF compression or data block encoding it results in an HFile 5.8 Mb in size. This is an improvement from the original 6.2 Mb baseline. It doesn’t seem like much, but it’s still a 6.5% reduction in the eventual wire-transfer footprint.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY

5,778,888

1000

NONE

NONE

Applying Snappy compression can reduce the HFile size further:

1,349,451

1000

SNAPPY

DIFF

1,390,422

1000

SNAPPY

FAST_DIFF

1,536,540

1000

SNAPPY

PREFIX

1,785,480

1000

SNAPPY

NONE

Case #3: What if we re-designed the rowkey (and left the column names alone)?

In this example, what if we only redesigned the rowkey? After using the “thin” rowkey the result is an HFile size that is 4.9 Mb down from the 6.2 Mb baseline, a 21% reduction. Not a small savings!

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATCOL

4,920,984

1000

NONE

NONE

Applying Snappy compression can reduce the HFile size further:

1,295,895

1000

SNAPPY

DIFF

1,337,112

1000

SNAPPY

FAST_DIFF

1,489,446

1000

SNAPPY

PREFIX

1,739,871

1000

SNAPPY

NONE

However, note that the resulting HFile size with Snappy and no data block encoding (1.7 Mb) is very similar in size to the baseline approach (i.e., fat rowkeys, fat column-names) with Snappy and no data block encoding (1.8 Mb). Why? The CF compression can compensate on disk for a lot of bloat in rowkeys and column names.

Case #4: What if we re-designed both the rowkey and the column names?

By this time we’ve learned enough HBase to know that we need to have efficient rowkeys and column-names. This produces an HFile that is 4.4 Mb, a 29% savings over the baseline of 6.2 Mb.

4,406,418

1000

NONE

NONE

Applying Snappy compression can reduce the HFile size further:

1,296,402

1000

SNAPPY

DIFF

1,338,135

1000

SNAPPY

FAST_DIFF

1,485,192

1000

SNAPPY

PREFIX

1,732,746

1000

SNAPPY

NONE

Again, the on-disk footprint with compression isn’t radically different from the others, as Compression can compensate to large degree for rowkey and column name bloat.

Case #5: KeyValue Storage Approach (e.g., 1 KV vs. KV-per-Column)

What if we did something radical and changed how we stored the data in HBase? With this approach, we are using a single KeyValue per row holding all of the columns of data for the row instead of a KeyValue per column (the traditional way).

The resulting HFile, even uncompressed and without Data Block Encoding, is radically smaller at 1.4 Mb compared to 6.2 Mb.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-AVRO

1,374,465

1000

NONE

NONE

Adding Snappy compression and Data Block Encoding makes the resulting HFile size even smaller.

1,119,330

1000

SNAPPY

DIFF

1,129,209

1000

SNAPPY

FAST_DIFF

1,133,613

1000

SNAPPY

PREFIX

1,150,779

1000

SNAPPY

NONE

Compare the 1.1 Mb Snappy without encoding to the 1.7 Snappy encoded Thin rowkey/Thin column-name.

Summary

Although Compression and Data Block Encoding can wallpaper over bad rowkey and column-name decisions in terms of HFile size, you will pay the price for this in terms of data transfer from RegionServer to Client. Also, concealing the size penalty brings with it a performance penalty each time the data is accessed or manipulated. So, the old advice about correctly designing rowkeys and column names still holds.

In terms of KeyValue approach, having a single KeyValue per row presents significant savings both in terms of data transfer (RegionServer to Client) as well as HFile size. However, there is a consequence with this approach in having to update each row entirely, and that old versions of rows also be stored in their entirety (i.e., as opposed to column-by-column changes). Furthermore, it is impossible to scan on select columns; the whole row must be retrieved and deserialized to access any information stored in the row. The importance of understanding this tradeoff cannot be over-stated, and is something that must be evaluated on an application-by-application basis.

Software engineering is an art of managing tradeoffs, so there isn’t necessarily one “best” answer. Importantly, this experiment only measures the file size and not the time or processor load penalties imposed by the use of compression, encoding, or Avro. The results generated in this test are still based on certain assumptions and your mileage may vary.

Here is the data if interested: http://people.apache.org/~dmeil/HBase_HFile_Size_2014_04.csv

Monday Mar 24, 2014

HBaseCon2014, the HBase Community Event, May 5th in San Francisco

HBaseCon2014

By the HBaseCon Program Committee and Justin Kestelyn


HBaseCon 2014 (May 5, 2014 in San Francisco) is coming up fast.  It is shaping up to be another grand day out for the HBase community with a dense agenda spread over four tracks of case studies, dev, ops, and ecosystem.  Check out our posted agenda grid. As in previous years, the program committee favored presentations on open source, shipping technologies and descriptions of in-production systems. The schedule is chock-a-block with a wide diversity of deploy types (industries, applications, scALE!) and interesting dev and operations stories.  Come to the conference and learn from your peer HBase users, operators and developers.


The general session will lead off with a welcome by Mike Olson of Cloudera (Cloudera is the host for this HBase community conference).

Next up will be a talk by the leads of the BigTable team at Google, Avtandil Garakanidze and Carter Page.  They will present on BigTable, the technology that HBase is based on.  Bigtable is “...the world’s largest multi-purpose database, supporting 90 percent of Google’s applications around the world.”  They will give a brief overview of “...Bigtable evolution since it was originally described in an OSDI ’06 paper, its current use cases at Google, and future directions.”

Then comes our own Liyin Tang of Facebook and Lars Hofhansl of Salesforce.com.  Liyin will speak on, “HydraBase: Facebook’s Highly Available and Strongly Consistent Storage Service Based on Replicated HBase Instances”.  HBase powers multiple mission-critical online applications at Facebook. This talk will cover the design of HydraBase (including the replication protocol), an analysis of a failure scenario, and a plan for contributions to the HBase trunk.”  Lars will explain how Salesforce.com’s scalability requirements led it to HBase and the multiple use cases for HBase there today. You’ll also learn how Salesforce.com works with the HBase community, and get a detailed look into its operational environment.

We then break up into sessions where you will be able to learn about operations best practices, new features in depth, and what is being built on top of HBase. You can also hear how Pinterest builds large scale HBase apps up in AWS, HBase at Bloomberg, OPower, RocketFuel, Xiaomi, Yahoo! and others, as well as a nice little vignette on how HBase is an enabling technology at OCLC replacing all instances of Oracle powering critical services for the libraries of the world.

There will be an optional pre-conference prep for attendees who are new to HBase by Jesse Anderson (Cloudera University).  Jesse’s talk will offer a brief Cliff’s Notes-level HBase overview before the General Session that covers architecture, API, and schema design.

Register now for HBaseCon 2014 at: https://hbasecon2014.secure.mfactormeetings.com/registration/

A big shout out goes out to our sponsors – Continuuity, Hortonworks, Intel, LSI, MapR, Salesforce.com, Splice Machine, WibiData (Gold); Facebook, Pepperdata (Silver); ASF (Community); O’Reilly Media, The Hive, NoSQL Weekly (Media) — without which HBaseCon would be impossible!

Wednesday Feb 19, 2014

Apache HBase 0.98.0 is released

hbase-0.98.0 is now available for download from the Apache mirrors and its artifacts are available in the Apache Maven repository[Read More]

Friday Jan 10, 2014

HBase Cell Security

By Andrew Purtell, HBase Committer and member of the Intel HBase Team

Introduction

Apache HBase is “the Apache Hadoop database”, a horizontally scalable nonrelational datastore built on top of components offered by the Apache Hadoop ecosystem, notably Apache ZooKeeper and Apache Hadoop HDFS. Although HBase therefore offers first class Hadoop integration, and is often chosen for that reason, it has come into its own as a good choice for high scale data storage of record. HBase is often the kernel of big data infrastructure.

There are many advantages offered by HBase: it is free open source software, it offers linear and modular scalability, it has strictly consistent reads and writes, automatic and configurable sharding and automatic failover are core concerns, and it has a deliberate tolerance for operation on “commodity” hardware, which is a very significant benefit at high scale. As more organizations are faced with Big Data challenges, HBase is increasingly found in roles formerly occupied by traditional data management solutions, gaining new users with new perspectives and the requirements and challenges of new use cases. Some of those users have a strong interest in security. They may be a healthcare provider operating within a strict regulatory regime regarding the access and sharing of patient information. They might be a consumer web property in a jurisdiction with strong data privacy laws. They could be a military or government agency that must manage a strict separation between multiple levels of information classification.

Access Control Lists and Security Labels

For some time Apache HBase has offered a strong security model based on Kerberos authentication and access control lists (ACLs). When Yahoo first made a version of Hadoop capable of strong authentication available in 2009, my team and I at a former employer, a commercial computer security vendor, were very interested. We needed a scale out platform for distributed computation, but we also needed one that could provide assurance against access of information by unauthorized users or applications. Secure Hadoop was a good fit. We also found in HBase a datastore that could scale along with computation, and offered excellent Hadoop integration, except first we needed to add feature parity and integration with Secure Hadoop. Our work was contributed back to Apache as HBASE-2742 and ZOOKEEPER-938. We also contributed an ACL-based authorization engine as HBASE-3025, and the coprocessor framework upon which it was built as HBASE-2000 and others. As a result, in the Apache family of Big Data storage options, HBase was the first to offer strong authentication and access control. These features have been improved and evolved by the HBase community many times over since then.

An access control list, or ACL, is a list of permissions associated with an object. The ACL specifies which subjects (users or system processes) shall be granted access to those objects, as well as which operations are allowed. Each entry in an ACL describes a subject and the operation(s) the subject is entitled to perform. When a subject requests an operation, HBase first uses Hadoop’s strong authentication support, based on Kerberos, to verify the subject’s identity. HBase then finds the relevant ACLs, and checks the entries of those ACLs to decide whether or not the request is authorized. This is an access control model that provides a lot of assurance, and is a flexible way for describing security policy, but not one that addresses all possible needs. We can do more.

All values written to HBase are stored in what is known as a cell. (“Cell” is used interchangeably with “key-value” or “KeyValue”, mainly for legacy reasons.) Cells are identified by a multidimensional key: { row, column, qualifier, timestamp }. (“Column” is used interchangeably with “family” or “column family”.) The table is implicit in every key, even if not actually present, because all cells are written into columns, and every column belongs to a table. This forms a hierarchical relationship: 

table -> column family -> cell

Users can today grant individual access permissions to subjects on tables and column families. Table permissions apply to all column families and all cells within those families. Column family permissions apply to all cells within the given family. The permissions will be familiar to any DBA: R (read), W (write), C (create), X (execute), A (admin). However for various reasons, until today, cell level permissions were not supported.

Other high scale data storage options in the Apache family, notably Apache Accumulo, take a different approach. Accumulo has a data model almost identical to that of HBase, but implements a security mechanism called cell-level security. Every cell in an Accumulo store can have a label, stored effectively as part of the key, which is used to determine whether a value is visible to a given subject or not. The label is not an ACL, it is a different way of expressing security policy. An ACL says explicitly what subjects are authorized to do what. A label instead turns this on its head and describes the sensitivity of the information to a decision engine that then figures out if the subject is authorized to view data of that sensitivity based on (potentially, many) factors. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while enforcing strict separation between multiple levels of information classification. HBase users might approximate this model using ACLs, but it would be labor intensive and error prone.

New HBase Cell Security Features

Happily our team here at Intel has been busy extending HBase with cell level security features. First, contributed as HBASE-8496, HBase can now store arbitrary metadata for a cell, called tags, along with the cell. Then, as of HBASE-7662, HBase can store into and apply ACLs from cell tags, extending the current HBase ACL model down to the cell. Then, as of HBASE-7663, HBase can store visibility expressions into tags, providing cell-level security capabilities similar to Apache Accumulo, with API and shell support that will be familiar to Accumulo users. Finally, we have also contributed transparent server side encryption, as HBASE-7544, for additional assurance against accidental leakage of data at rest. We are working with the HBase community to make these features available in the next major release of HBase, 0.98.

Let’s talk a bit more now about HBase visibility labels and per-cell ACLs work.

HFile version 3

In order to take advantage of any cell level access control features, it will be necessary to store data in the new HFile version, 3. HFile version 3 is very similar to HFile version 2 except it also has support for persisting and retrieving cell tags, optional dictionary based compression of tag contents, and the HBASE-7544 encryption feature. Enabling HFile version 3 is as easy as adding a single configuration setting to all HBase site XML files, followed by a rolling restart. All existing HFile version 2 files will be read normally. New files will be written in the version 3 format. Although HFile version 3 will be marked as experimental throughout the HBase 0.98 release cycle, we have found it to be very stable under high stress conditions on our test clusters.

HBase Visibility Labels

We have introduced a new coprocessor, the VisibilityController, which can be used on its own or in conjunction with HBase’s AccessController (responsible for ACL handling). The VisibilityController determines, based on label metadata stored in the cell tag and associated with a given subject, if the user is authorized to view the cell. The maximal set of labels granted to a user is managed by new shell commands getauths, setauths, and clearauths, and stored in a new HBase system table. Accumulo users will find the new HBase shell commands familiar.

When storing or mutating a cell, the HBase user can now add visibility expressions, using a backwards compatible extension to the HBase API. (By backwards compatible, we mean older servers will simply ignore the new cell metadata, as opposed to throw an exception or fail.)

Mutation#setCellVisibility(new CellVisibility(String labelExpession));

The visibility expression can contain labels joined with logical expressions ‘&’, ‘|’ and ‘!’. Also using ‘(‘, ‘)’ one can specify the precedence order. For example, consider the label set { confidential, secret, topsecret, probationary }, where the first three are sensitivity classifications and the last describes if an employee is probationary or not. If a cell is stored with this visibility expression:

( secret | topsecret ) & !probationary

Then any user associated with the secret or topsecret label will be able to view the cell, as long as the user is not also associated with the probationary label. Furthermore, any user only associated with the confidential label, whether probationary or not, will not see the cell or even know of its existence. Accumulo users will also find HBase visibility expressions familiar, but also providing a superset of boolean operators.

We build the user’s label set in the RPC context when a request is first received by the HBase RegionServer. How users are associated with labels is pluggable. The default plugin passes through labels specified in Authorizations added to the Get or Scan. This will also be familiar to Accumulo users. 

Get#setAuthorizations(new Authorizations(String,…));

Scan#setAuthorizations(new Authorizations(String,…));

Authorizations not in the maximal set of labels granted to the user are dropped. From this point, visibility expression processing is very fast, using set operations.

In the future we envision additional plugins which may interrogate an external source when building the effective label set for a user, for example LDAP or Active Directory. Consider our earlier example. Perhaps the sensitivity classifications are attached when cells are stored into HBase, but the probationary label, determined by the user’s employment status, is provided by an external authorization service.

HBase Cell ACLs

We have extended the existing HBase ACL model to the cell level.

When storing or mutating a cell, the HBase user can now add ACLs, using a backwards compatible extension to the HBase API.

Mutation#setACL(String user, Permission perms);

Like at the table or column family level, a subject is granted permissions to the cell. Any number of permissions for any number of users (or groups using @group notation) can be added.

From then on, access checks for operations on the cell include the permissions recorded in the cell’s ACL tag, using union-of-permission semantics.

First we check table or column family level permissions*. If they grant access, we can early out before going to blockcache or disk to check the cell for ACLs. Table designers and security architects can therefore optimize for the common case by granting users permissions at the table or column family level. However, if indeed some cells require more fine grained control, if neither table nor column family checks succeed, we will enumerate the cells covered by the operation. By “covered”, this means we insure that every location which would be visibly modified by the operation has a cell ACL in place that grants access. We can stop at the first cell ACL that does not grant access.

For a Put, Increment, or Append we check the permissions for the most recent visible version. “Visible” means not covered by a delete tombstone. We treat the ACLs in each Put as timestamped like any other HBase value. A new ACL in a new Put applies to that Put. It doesn't change the ACL of any previous Put. This allows simple evolution of security policy over time without requiring expensive updates. To change the ACL at a specific { row, column, qualifier, timestamp } coordinate, a new value with a new ACL must be stored to that location exactly.

For Increments and Appends, we do the same thing as with Puts, except we will propagate any ACLs on the previous value unless the operation carries a new ACL explicitly.

Finally, for a Delete, we require write permissions on all cells covered by the delete. Unlike in the case of other mutations we need to check all visible prior versions, because a major compaction could remove them. If the user doesn't have permission to overwrite any of the visible versions ("visible", again, is defined as not covered by a tombstone already) then we have to disallow the operation.

* - For the sake of coherent explanation, this overlooks an additional feature. Optionally, on a per-operation basis, how cell ACLs are factored into the authorization decision can be flipped around. Instead of first checking table or column family level permissions, we enumerate the set of ACLs covered by the operation first, and only if there are no grants there we check for table or column family permissions. This is useful for use cases where a user is not granted table or column family permissions on a table and instead the cell level ACLs provide exceptional access. The default is useful for use cases where the user is granted table or column family permissions and cell level ACLs might withhold authorization. The default is likely to perform better. Again, which strategy is used can be specified on a per-operation basis.

Andrew 

This blog post was republished from https://communities.intel.com/community/datastack/blog/2013/10/29/hbase-cell-security

Friday Oct 25, 2013

Phoenix Guest Blog

logo

James Taylor 

I'm thrilled to be guest blogging today about Phoenix, a BSD-licensed open source SQL skin for HBase that powers the HBase use cases at Salesforce.com. As opposed to relying on map-reduce to execute SQL queries as other similar projects, Phoenix compiles your SQL query into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. On top of that, you can add secondary indexes to your tables to transform what would normally be a full table scan into a series of point gets (and we all know how good HBase performs with those).

Getting Started

To get started using phoenix, follow these directions:

  • Download and expand the latest phoenix-2.1.0-install.tar from download page
  • Add the phoenix-2.1.0.jar to the classpath of every HBase region server. An easy way to do this is to copy it into the HBase lib directory.
  • Restart all region servers
  • Start our terminal interface that is bundled with our distribution against your cluster:
          $ bin/sqlline.sh localhost
    
  • Create a Phoenix table:
          create table stock_price(ticker varchar(6), price decimal, date date);
    
  • Load some data either through one of our CSV loading tools, or by mapping an existing HBase table to Phoenix table.
  • Run some queries:
          select * from stock_price;
          select * from stock_price where ticker='CRM';
          select ticker, avg(price) from stock_price 
              where date between current_date()-30 and current_date() 
              group by ticker;
    

Questions/Comments?

Take a look at our recent announcement for our release which includes support for secondary indexing, peruse our FAQs and join our mailing list.


Tuesday Oct 22, 2013

HBase 0.96.0 Released

by Stack, the 0.96.0 Release Manager

Here are some notes on our recent hbase-0.96.0 release (For the complete list of over 2k issues addressed in 0.96.0, see Apache JIRA Release Notes).


hbase-0.96.0 was more than a year in the making.   It was heralded by three developer releases -- 0.95.0, 0.95.1 and 0.95.2 -- and it went through six release candidates before we arrived at our final assembly released on Friday, October 18th, 2013.


The big themes that drove this release gleaned of rough survey of users and our experience with HBase deploys were:


  • Improved stability: A new suite of integration cluster tests (HBASE-6241 HBASE-6201), configurable by node count, data sizing, duration, and “chaos” quotient, turned up loads of bugs around assignment and data views when scanning and fetching.  These we fixed in hbase-0.96.0.  Table locks added for cross-cluster alterations and cross-row transaction support, enabled on our system tables by default, now have us wide-berth whole classes of problematic states.

  • Scaling: HBase is being deployed on larger clusters.  How we kept schema in the filesystem or our archiving a WAL file at a time when done replicating worked fine on clusters of hundreds of nodes but made for significant friction when we moved to the next scaling level up.

  • Mean Time To Recovery (MTTR): A sustained effort in HBase and in our substrate, HDFS, narrowed the amount of time data is offline after node outage.

  • Operability: Many new tools were added to help operators of hbase clusters: from a radical redo of the metrics emissions, through a new UI  and exposed hooks for health scripts.  It is now possible to trace lagging calls down through the HBase stack (HBASE-9121 Update HTrace to 2.00 and add new example usage) to figure where time is spent and soon, through HDFS itself, with support for pretty visualizations in Twitter Zipkin (See HBase Tracing from a recent meetup).

  • Freedom to Evolve: We redid how we persisted everywhere, whether in the filesystem or up into zookeeper, but also how we carry queries and data back and forth over RPC.  Where serialization was hand-crafted when we were on Hadoop Writables, we now use generated Google protobufs.  Standardizing serialization on protobufs, with well-defined schemas, will make it easier evolving versions of the client and servers independently of each other in a compatible manner without having to take a cluster restart going forward.

  • Support for hadoop1 and hadoop2: hbase-0.96.0 will run on either.  We do not ship a universal binary.  Rather you must pick your poison; hbase-0.96.0-hadoop1 or hbase-0.96.0-hadoop2 (Differences in APIs between the two versions of Hadoop forced this delivery format).  hadoop2 is far superior to hadoop1 so we encourage you move to it.  hadoop2 has improvements that make HBase operation run smoother, facilitates better performance -- e.g. secure short-circuit reads -- as well as fixes that help our MTTR story.

  • Minimal disturbance to the API: Downstream projects should just work.  The API has been cleaned up and divided into user vs developer APIs and all has been annotated using Hadoop’s system for denoting APIs stable, evolving, or private.  That said, a load of work was invested making it so APIs were retained. Radical changes in API that were present in the last developer release were undone in late release candidates because of downstreamer feedback.


Below we dig in on a few of the themes and features shipped in 0.96.0.


Mean Time To Recovery

HBase guarantees a consistent view by having a single server at a time solely responsible for data. If this server crashes, data is ‘offline’ until another server assumes responsibility.  When we talk of improving Mean Time To Recovery in HBase, we mean narrowing the time during which data is offline after a node crash.  This offline period is made up of phases: a detection phase, a repair phase, reassignment, and finally, clients noticing the data available in its new location.  A fleet of fixes to shrink all of these distinct phases have gone into hbase-0.96.0.


In the detection phase, the default zookeeper session period has been shrunk and a sample watcher script will intercede on server outage and delete the regionservers ephemeral node so the master notices the crashed server missing sooner (HBASE-5844 Delete the region servers znode after a regions server crash). The same goes for the master (HBASE-5926  Delete the master znode after a master crash). At repair time, a running tally makes it so we replay fewer edits cutting replay time (HBASE-6659 Port HBASE-6508 Filter out edits at log split time).  A new replay mechanism has also been added (HBASE-7006 Distributed log replay -- disabled by default) that speeds recovery by skipping having to persist intermediate files in HDFS.  The HBase system table now has its own dedicated WAL, so this critical table can come back before all others (See HBASE-7213 / HBASE-8631).  Assignment all around has been speeded up by bulking up operations, removing synchronizations, and multi-threading so operations can run in parallel.


In HDFS, a new notion of ‘staleness’ was introduced (HDFS-3703, HDFS-3712).  On recovery, the namenode will avoid including stale datanodes saving on our having to first timeout against dead nodes before we can make progress (Related, HBase avoids writing a local replica when writing the WAL instead writing all replicas remote out on the cluster; the replica that was on the dead datanode is of no use come recovery time. See HBASE-6435 Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes).  Other fixes, such as HDFS-4721 Speed up lease/block recovery when DN fails and a block goes into recovery shorten the time involved assuming ownership of the last, unclosed WAL on server crash.


And there is more to come inside the 0.96.x timeline: e.g. bringing regions online immediately for writes, retained locality when regions come up on the new server because we write replicas using the ‘Favored Nodes’ feature, etc.


Be sure to visit the Reference Guide for configurations that enable and tighten MTTR; for instance, ‘staleness’ detection in HDFS needs to be enabled on the HDFS-side.  See the Reference Guide for how.


HBASE-5305 Improve cross-version compatibility & upgradeability

Freedom to Evolve

Rather than continue to hand-write serializations as is required when using Hadoop Writables, our serialization means up through hbase-0.94.x, in hbase-0.96.0 we moved the whole shebang over to protobufs.  Everywhere HBase persists we now use protobuf serializations whether writing zookeeper znodes, files in HDFS, and whenever we send data over the wire when RPC’ing.


Protobufs support evolving types, if careful, making it so we can amend Interfaces in a compatible way going forward, a freedom we were sorely missing -- or to be more precise, was painful to do -- when all serialization was by hand in Hadoop Writables.  This change breaks compatibility with previous versions.


Our RPC is also now described using protobuf Service definitions.  Generated stubs are hooked up to a derivative, stripped down version of the Hadoop RPC transport.  Our RPC now has a specification.  See the Appendix in the Reference Guide.


HBASE-8015 Support for Namespaces

Our brothers and sisters over at Yahoo! contributed table namespaces, a means of grouping tables similar to mysql’s notion of database, so they can better manage their multi-tenant deploys.  To follow in short order will be quota, resource allocation, and security all by namespace.


HBASE-4050 Rationalize metrics, metric2 framework implementation


New metrics have been added and the whole plethora given a radical edit, better categorization, naming and typing; patterns were enforced so the myriad metrics are navigable and look pretty up in JMX.  Metrics have been moved up on to the Hadoop 2 Metrics 2 Interfaces.  See Migration to the New Metrics Hotness – Metrics2 for detail.


New Region Balancer


A new balancer using an algorithm similar to Simulated annealing or Greedy Hillclimbing factors in not only region count, the only attribute considered by the old balancer, but also region read/write load, locality, among other attributes, coming up with a balance decision.

Cell


In hbase-0.96.0, we began work on a long-time effort to move off of our base KeyValue type and move instead to use a Cell Interface throughout the system.  The intent is to open up the way to try different implementations of the base type; different encodings, compressions, and layouts of content to better align with how the machine works. The move, though not yet complete, has already yielded performance gains.  The Cell Interface shows through in our hbase-0.96.0 API with the KeyValue references deprecated in 0.96.0.  All further work should be internal-only and transparent to the user.

Incompatible changes



Miscellaneous




You will need to restart your cluster to come up on hbase-0.96.0.  After deploying the binaries, run a checker script that will look for the existence of old format HFiles no longer supported in hbase-0.96.0.  The script will warn you of their presence and will ask you to compact them away.  This can be done without disturbing current serving.  Once all have been purged, stop your cluster, and run a small migration script. The HBase migration script will upgrade the content of zookeeper and rearrange the content of the filesystem to support the new table namespaces feature.  The migration should take a few minutes at most.  Restart.  See Upgrading from 0.94.x to 0.96.x for details.


From here on out, 0.96.x point releases with bug fixes only will start showing up on a roughly monthly basis after the model established in our hbase-0.94 line.  hbase-0.98.0, our next major version, is scheduled to follow in short order (months).  You will be able to do a rolling restart off 0.96.x and up onto 0.98.0.  Guaranteed.


A big thanks goes out to all who helped make hbase-0.96.0 possible.


This release is dedicated to Shaneal Manek, HBase contributor.

Download your hbase-0.96.0 here.

Thursday Aug 08, 2013

Taking the Bait

Lars Hofhansl, Andrew Purtell, and Michael Stack

HBase Committers


Information Week recently published an article titled “Will HBase Dominate NoSQL?”. Michael Hausenblas of MapR argues the ‘For’ HBase case and Jonathan Ellis of Apache Cassandra and vendor DataStax argues ‘Against’.

It is easy to dismiss this 'debate' as vendor sales talk and just get back to using and improving Apache HBase, but this article is a particularly troubling example. Here in both the ‘For’ and ‘Against’ arguments slight is being cast on the work of the HBase community. Here are some notes by way of redress:

First, Michael argues Hadoop is growing fast and because HBase came out of Hadoop and is tightly associated, ergo, HBase is on the up and up.  It is easy to make this assumption if you are not an active participant in the HBase community (We have also come across the inverse where HBase is driving Hadoop adoption). Michael then switches to a tired HBase versus Cassandra bake off, rehashing the old consistency versus eventual-consistency wars, ultimately concluding that HBase is ‘better’ simply because Facebook, the Cassandra wellspring, dropped it to use HBase instead as the datastore for some of their large apps. We would not make that kind of argument. Whether or not one should utilize Apache HBase or Apache Cassandra depends on a number of factors and is, like with most matters of scale, too involved a discussion for sound bites.


Then Michael does a bait-and-switch where he says “..we’ve created a ‘next version’ of enterprise HBase.... We brought it into GA under the label M7 in May 2013”.  The ‘We’ in the quote does not refer to the Apache HBase community but to Michael’s employer, MapR Technologies and the ‘enterprise HBase’ he is talking of is not Apache HBase.  M7 is a proprietary product that, to the best of our knowledge, is fundamentally different architecturally from Apache HBase. We cannot say more because of the closed source nature of the product. This strikes us as an attempt to attach the credit and good will that the Apache HBase community have all built up over the years through hard work and contributions to a commercial closed source product that is NOT Apache HBase.


Now let us address Jonathan’s “Against” argument.  Some of Jonathan’s claims are no longer true: “RegionServer failover takes 10 to 15 minutes” (see HDFS-3703 and HDFS-3912), or highly subjective: “Developing against HBase is painful.”  (In our opinion, our client API is simpler and easier to use than the commonly used Cassandra client libraries.) Otherwise, we find nothing here that has not been hashed and rehashed out over the years in forums ranging from mailing lists to Quora. Jonathan is mostly listing out provence, what comes of our being tightly coupled to Apache Hadoop’s HDFS filesystem and our tracing the outline of the Google BigTable Architecture. HBase derives many benefits from HDFS and inspiration from BigTable. As a result, we excel at some use cases that would be problematic for Cassandra. The reverse is also true. Neither we nor HDFS are standing still as the hardware used by our users evolves, or as new users bring experience from new use cases.


He highlights a difference in approach.


We see our adoption of Apache HDFS, our integration with Apache Hadoop MapReduce, and our use of Apache ZooKeeper for coordination as a sound separation of concerns. HBase is built on battle tested components. This is a feature, not a bug.


Where Jonathan might see a built in query language and secondary indexing facility as necessary complications to core Cassandra, we encourage and support projects like Salesforce’s Phoenix as part of a larger ecosystem centered around Apache HBase. The Phoenix guys are able to bring domain expertise to that part of the problem while we (HBase) can focus on providing a stable and performant storage engine core. Part of what has made Apache Hadoop so successful is its ecosystem of supporting and enriching projects - an ecosystem that includes HBase. An ecosystem like that is developing around HBase.


When Jonathan veers off to talk of the HBase community being “fragmented” with divided “[l]eadership”, we think perhaps what is being referred to is the fact that the Apache HBase project is not an “owned” project, a project led by a single vendor.  Rather it is a project where multiple teams from many organizations - Cloudera, Hortonworks, Salesforce, Facebook, Yahoo, Intel, Twitter, and Taobao, to name a few - are all pushing the project forward in collaboration. Most of the Apache HBase community participates in shared effort on two branches - what is about to be our new 0.96 release, and on another branch for our current stable release, 0.94. Facebook also maintains their own branch of Apache HBase in our shared source control repository. This branch has been a source of inspiration and occasional direct contribution to other branches. We make no apologies for finding a convenient and effective collaborative model according to the needs and wishes of our community. To other projects driven by a single vendor this may seem suboptimal or even chaotic (“fragmented”).


We’ll leave it at that.


As always we welcome your participation in and contributions to our community and the Apache HBase project. We have a great product and a no-ego low-friction decision making process not beholden to any particular commercial concern. If you have an itch that needs scratching and wonder if Apache HBase is a solution, or, if you are using Apache HBase but feel it could work better for you, come talk to us.

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation