Entries tagged [hbase]

Friday Apr 11, 2014

The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size

By Doug Meil, HBase Committer and Thomas Murphy

Intro

One of the most common questions in the HBase user community is estimating disk footprint of tables, which translates into HFile size – the internal file format in HBase.

We designed an experiment at Explorys where we ran combinations of design time options (rowkey length, column name length, row storage approach) and runtime options (HBase ColumnFamily compression, HBase data block encoding options) to determine these factors’ effects on the resultant HFile size in HDFS.

HBase Environment

CDH4.3.0 (HBase 0.94.6.1)

Design Time Choices

  1. Rowkey

    1. Thin

      1. 16-byte MD5 hash of an integer.

    2. Fat

      1. 64-byte SHA-256 hash of an integer.

    1. Note: neither of these are realistic rowkeys for real applications, but they chosen because they are easy to generate and one is a lot bigger than the other.

  1. Column Names

    1. Thin

      1. 2-3 character column names (c1, c2).

    2. Fat

      1. 10 characters, randomly chosen but consistent for all rows.

    1. Note: it is advisable to have small column names, but most people don’t start that way so we have this as an option.

  1. Row Storage Approach

    1. Key Value Per Column

      1. This is the traditional way of storing data in HBase.

    2. One Key Value per row.

      1. Actually, two.

      2. One KV has an Avro serialized byte array containing all the data from the row.

      3. Another KV holds an MD5 hash of the version of the Avro schema.

Run Time

  1. Column Family Compression

    1. None

    2. GZ

    3. LZ4

    4. LZO

    5. Snappy


    1. Note: it is generally advisable to use compression, but what if you didn’t? So we tested that too.

  1. HBase Block Encoding

    1. None

    2. Prefix

    3. Diff

    4. Fast Diff

    1. Note: most people aren’t familiar with HBase Data Block Encoding. Primarily intended for squeezing more data into the block cache, it has effects on HFile size too. See HBASE-4218 for more detail.

1000 rows were generated for each combination of table parameters. Not a ton of data, but we don’t necessarily need a ton of data to see the varying size of the table. There were 30 columns per row comprised of 10 strings (each filled with 20 bytes of random characters), 10 integers (random numbers) and 10 longs (also random numbers).

HBase blocksize was 128k.


Results

The easiest way to navigate the results is to compare specific cases, progressing from an initial implementation of a table to options for production.

Case #1: Fat Rowkey and Fat Column Names, Now What?

This is where most people start with HBase. Rowkeys are not as optimal as they should be (i.e., the Fat rowkey case) and column names are also inflated (Fat column-names).

Without CF Compression or Data Block Encoding, the baseline is:

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY-FATCOL

6,293,670

1000

NONE

NONE

What if we just changed CF compression?

This drastically changes the HFile footprint. Snappy compression reduces the HFile size from 6.2 Mb to 1.8 Mb, for example.

1,362,033

1000

GZ

NONE

1,803,240

1000

SNAPPY

NONE

1,919,265

1000

LZ4

NONE

1,950,306

1000

LZO

NONE

However, we shouldn’t be too quick to celebrate. Remember that this is just the disk footprint. Over the wire the data is uncompressed, so 6.2 Mb is still being transferred from RegionServer to Client when doing a Scan over the entire table.

What if we just changed data block encoding?

Compression isn’t the only option though. Even without compression, we can change the data block encoding and also achieve HFile reduction. All options are an improvement over the 6.2 Mb baseline.


1,491,000

1000

NONE

DIFF

1,492,155

1000

NONE

FAST_DIFF

2,244,963

1000

NONE

PREFIX

Combination

The following table shows the results of all remaining CF compression / data block encoding combinations.

1,146,675

1000

GZ

DIFF

1,200,471

1000

GZ

FAST_DIFF

1,274,265

1000

GZ

PREFIX

1,350,483

1000

SNAPPY

DIFF

1,358,190

1000

LZ4

DIFF

1,391,016

1000

SNAPPY

FAST_DIFF

1,402,614

1000

LZ4

FAST_DIFF

1,406,334

1000

LZO

FAST_DIFF

1,541,151

1000

SNAPPY

PREFIX

1,597,440

1000

LZO

PREFIX

1,622,313

1000

LZ4

PREFIX

Case #2: What if we re-designed the column names (and left the rowkey alone)?

Let’s assume that we re-designed our column names but left the rowkey alone. After using the “thin” column-names without CF compression or data block encoding it results in an HFile 5.8 Mb in size. This is an improvement from the original 6.2 Mb baseline. It doesn’t seem like much, but it’s still a 6.5% reduction in the eventual wire-transfer footprint.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY

5,778,888

1000

NONE

NONE

Applying Snappy compression can reduce the HFile size further:

1,349,451

1000

SNAPPY

DIFF

1,390,422

1000

SNAPPY

FAST_DIFF

1,536,540

1000

SNAPPY

PREFIX

1,785,480

1000

SNAPPY

NONE

Case #3: What if we re-designed the rowkey (and left the column names alone)?

In this example, what if we only redesigned the rowkey? After using the “thin” rowkey the result is an HFile size that is 4.9 Mb down from the 6.2 Mb baseline, a 21% reduction. Not a small savings!

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATCOL

4,920,984

1000

NONE

NONE

Applying Snappy compression can reduce the HFile size further:

1,295,895

1000

SNAPPY

DIFF

1,337,112

1000

SNAPPY

FAST_DIFF

1,489,446

1000

SNAPPY

PREFIX

1,739,871

1000

SNAPPY

NONE

However, note that the resulting HFile size with Snappy and no data block encoding (1.7 Mb) is very similar in size to the baseline approach (i.e., fat rowkeys, fat column-names) with Snappy and no data block encoding (1.8 Mb). Why? The CF compression can compensate on disk for a lot of bloat in rowkeys and column names.

Case #4: What if we re-designed both the rowkey and the column names?

By this time we’ve learned enough HBase to know that we need to have efficient rowkeys and column-names. This produces an HFile that is 4.4 Mb, a 29% savings over the baseline of 6.2 Mb.

4,406,418

1000

NONE

NONE

Applying Snappy compression can reduce the HFile size further:

1,296,402

1000

SNAPPY

DIFF

1,338,135

1000

SNAPPY

FAST_DIFF

1,485,192

1000

SNAPPY

PREFIX

1,732,746

1000

SNAPPY

NONE

Again, the on-disk footprint with compression isn’t radically different from the others, as Compression can compensate to large degree for rowkey and column name bloat.

Case #5: KeyValue Storage Approach (e.g., 1 KV vs. KV-per-Column)

What if we did something radical and changed how we stored the data in HBase? With this approach, we are using a single KeyValue per row holding all of the columns of data for the row instead of a KeyValue per column (the traditional way).

The resulting HFile, even uncompressed and without Data Block Encoding, is radically smaller at 1.4 Mb compared to 6.2 Mb.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-AVRO

1,374,465

1000

NONE

NONE

Adding Snappy compression and Data Block Encoding makes the resulting HFile size even smaller.

1,119,330

1000

SNAPPY

DIFF

1,129,209

1000

SNAPPY

FAST_DIFF

1,133,613

1000

SNAPPY

PREFIX

1,150,779

1000

SNAPPY

NONE

Compare the 1.1 Mb Snappy without encoding to the 1.7 Snappy encoded Thin rowkey/Thin column-name.

Summary

Although Compression and Data Block Encoding can wallpaper over bad rowkey and column-name decisions in terms of HFile size, you will pay the price for this in terms of data transfer from RegionServer to Client. Also, concealing the size penalty brings with it a performance penalty each time the data is accessed or manipulated. So, the old advice about correctly designing rowkeys and column names still holds.

In terms of KeyValue approach, having a single KeyValue per row presents significant savings both in terms of data transfer (RegionServer to Client) as well as HFile size. However, there is a consequence with this approach in having to update each row entirely, and that old versions of rows also be stored in their entirety (i.e., as opposed to column-by-column changes). Furthermore, it is impossible to scan on select columns; the whole row must be retrieved and deserialized to access any information stored in the row. The importance of understanding this tradeoff cannot be over-stated, and is something that must be evaluated on an application-by-application basis.

Software engineering is an art of managing tradeoffs, so there isn’t necessarily one “best” answer. Importantly, this experiment only measures the file size and not the time or processor load penalties imposed by the use of compression, encoding, or Avro. The results generated in this test are still based on certain assumptions and your mileage may vary.

Here is the data if interested: http://people.apache.org/~dmeil/HBase_HFile_Size_2014_04.csv

Monday Mar 24, 2014

HBaseCon2014, the HBase Community Event, May 5th in San Francisco

HBaseCon2014

By the HBaseCon Program Committee and Justin Kestelyn


HBaseCon 2014 (May 5, 2014 in San Francisco) is coming up fast.  It is shaping up to be another grand day out for the HBase community with a dense agenda spread over four tracks of case studies, dev, ops, and ecosystem.  Check out our posted agenda grid. As in previous years, the program committee favored presentations on open source, shipping technologies and descriptions of in-production systems. The schedule is chock-a-block with a wide diversity of deploy types (industries, applications, scALE!) and interesting dev and operations stories.  Come to the conference and learn from your peer HBase users, operators and developers.


The general session will lead off with a welcome by Mike Olson of Cloudera (Cloudera is the host for this HBase community conference).

Next up will be a talk by the leads of the BigTable team at Google, Avtandil Garakanidze and Carter Page.  They will present on BigTable, the technology that HBase is based on.  Bigtable is “...the world’s largest multi-purpose database, supporting 90 percent of Google’s applications around the world.”  They will give a brief overview of “...Bigtable evolution since it was originally described in an OSDI ’06 paper, its current use cases at Google, and future directions.”

Then comes our own Liyin Tang of Facebook and Lars Hofhansl of Salesforce.com.  Liyin will speak on, “HydraBase: Facebook’s Highly Available and Strongly Consistent Storage Service Based on Replicated HBase Instances”.  HBase powers multiple mission-critical online applications at Facebook. This talk will cover the design of HydraBase (including the replication protocol), an analysis of a failure scenario, and a plan for contributions to the HBase trunk.”  Lars will explain how Salesforce.com’s scalability requirements led it to HBase and the multiple use cases for HBase there today. You’ll also learn how Salesforce.com works with the HBase community, and get a detailed look into its operational environment.

We then break up into sessions where you will be able to learn about operations best practices, new features in depth, and what is being built on top of HBase. You can also hear how Pinterest builds large scale HBase apps up in AWS, HBase at Bloomberg, OPower, RocketFuel, Xiaomi, Yahoo! and others, as well as a nice little vignette on how HBase is an enabling technology at OCLC replacing all instances of Oracle powering critical services for the libraries of the world.

There will be an optional pre-conference prep for attendees who are new to HBase by Jesse Anderson (Cloudera University).  Jesse’s talk will offer a brief Cliff’s Notes-level HBase overview before the General Session that covers architecture, API, and schema design.

Register now for HBaseCon 2014 at: https://hbasecon2014.secure.mfactormeetings.com/registration/

A big shout out goes out to our sponsors – Continuuity, Hortonworks, Intel, LSI, MapR, Salesforce.com, Splice Machine, WibiData (Gold); Facebook, Pepperdata (Silver); ASF (Community); O’Reilly Media, The Hive, NoSQL Weekly (Media) — without which HBaseCon would be impossible!

Friday Jan 10, 2014

HBase Cell Security

By Andrew Purtell, HBase Committer and member of the Intel HBase Team

Introduction

Apache HBase is “the Apache Hadoop database”, a horizontally scalable nonrelational datastore built on top of components offered by the Apache Hadoop ecosystem, notably Apache ZooKeeper and Apache Hadoop HDFS. Although HBase therefore offers first class Hadoop integration, and is often chosen for that reason, it has come into its own as a good choice for high scale data storage of record. HBase is often the kernel of big data infrastructure.

There are many advantages offered by HBase: it is free open source software, it offers linear and modular scalability, it has strictly consistent reads and writes, automatic and configurable sharding and automatic failover are core concerns, and it has a deliberate tolerance for operation on “commodity” hardware, which is a very significant benefit at high scale. As more organizations are faced with Big Data challenges, HBase is increasingly found in roles formerly occupied by traditional data management solutions, gaining new users with new perspectives and the requirements and challenges of new use cases. Some of those users have a strong interest in security. They may be a healthcare provider operating within a strict regulatory regime regarding the access and sharing of patient information. They might be a consumer web property in a jurisdiction with strong data privacy laws. They could be a military or government agency that must manage a strict separation between multiple levels of information classification.

Access Control Lists and Security Labels

For some time Apache HBase has offered a strong security model based on Kerberos authentication and access control lists (ACLs). When Yahoo first made a version of Hadoop capable of strong authentication available in 2009, my team and I at a former employer, a commercial computer security vendor, were very interested. We needed a scale out platform for distributed computation, but we also needed one that could provide assurance against access of information by unauthorized users or applications. Secure Hadoop was a good fit. We also found in HBase a datastore that could scale along with computation, and offered excellent Hadoop integration, except first we needed to add feature parity and integration with Secure Hadoop. Our work was contributed back to Apache as HBASE-2742 and ZOOKEEPER-938. We also contributed an ACL-based authorization engine as HBASE-3025, and the coprocessor framework upon which it was built as HBASE-2000 and others. As a result, in the Apache family of Big Data storage options, HBase was the first to offer strong authentication and access control. These features have been improved and evolved by the HBase community many times over since then.

An access control list, or ACL, is a list of permissions associated with an object. The ACL specifies which subjects (users or system processes) shall be granted access to those objects, as well as which operations are allowed. Each entry in an ACL describes a subject and the operation(s) the subject is entitled to perform. When a subject requests an operation, HBase first uses Hadoop’s strong authentication support, based on Kerberos, to verify the subject’s identity. HBase then finds the relevant ACLs, and checks the entries of those ACLs to decide whether or not the request is authorized. This is an access control model that provides a lot of assurance, and is a flexible way for describing security policy, but not one that addresses all possible needs. We can do more.

All values written to HBase are stored in what is known as a cell. (“Cell” is used interchangeably with “key-value” or “KeyValue”, mainly for legacy reasons.) Cells are identified by a multidimensional key: { row, column, qualifier, timestamp }. (“Column” is used interchangeably with “family” or “column family”.) The table is implicit in every key, even if not actually present, because all cells are written into columns, and every column belongs to a table. This forms a hierarchical relationship: 

table -> column family -> cell

Users can today grant individual access permissions to subjects on tables and column families. Table permissions apply to all column families and all cells within those families. Column family permissions apply to all cells within the given family. The permissions will be familiar to any DBA: R (read), W (write), C (create), X (execute), A (admin). However for various reasons, until today, cell level permissions were not supported.

Other high scale data storage options in the Apache family, notably Apache Accumulo, take a different approach. Accumulo has a data model almost identical to that of HBase, but implements a security mechanism called cell-level security. Every cell in an Accumulo store can have a label, stored effectively as part of the key, which is used to determine whether a value is visible to a given subject or not. The label is not an ACL, it is a different way of expressing security policy. An ACL says explicitly what subjects are authorized to do what. A label instead turns this on its head and describes the sensitivity of the information to a decision engine that then figures out if the subject is authorized to view data of that sensitivity based on (potentially, many) factors. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while enforcing strict separation between multiple levels of information classification. HBase users might approximate this model using ACLs, but it would be labor intensive and error prone.

New HBase Cell Security Features

Happily our team here at Intel has been busy extending HBase with cell level security features. First, contributed as HBASE-8496, HBase can now store arbitrary metadata for a cell, called tags, along with the cell. Then, as of HBASE-7662, HBase can store into and apply ACLs from cell tags, extending the current HBase ACL model down to the cell. Then, as of HBASE-7663, HBase can store visibility expressions into tags, providing cell-level security capabilities similar to Apache Accumulo, with API and shell support that will be familiar to Accumulo users. Finally, we have also contributed transparent server side encryption, as HBASE-7544, for additional assurance against accidental leakage of data at rest. We are working with the HBase community to make these features available in the next major release of HBase, 0.98.

Let’s talk a bit more now about HBase visibility labels and per-cell ACLs work.

HFile version 3

In order to take advantage of any cell level access control features, it will be necessary to store data in the new HFile version, 3. HFile version 3 is very similar to HFile version 2 except it also has support for persisting and retrieving cell tags, optional dictionary based compression of tag contents, and the HBASE-7544 encryption feature. Enabling HFile version 3 is as easy as adding a single configuration setting to all HBase site XML files, followed by a rolling restart. All existing HFile version 2 files will be read normally. New files will be written in the version 3 format. Although HFile version 3 will be marked as experimental throughout the HBase 0.98 release cycle, we have found it to be very stable under high stress conditions on our test clusters.

HBase Visibility Labels

We have introduced a new coprocessor, the VisibilityController, which can be used on its own or in conjunction with HBase’s AccessController (responsible for ACL handling). The VisibilityController determines, based on label metadata stored in the cell tag and associated with a given subject, if the user is authorized to view the cell. The maximal set of labels granted to a user is managed by new shell commands getauths, setauths, and clearauths, and stored in a new HBase system table. Accumulo users will find the new HBase shell commands familiar.

When storing or mutating a cell, the HBase user can now add visibility expressions, using a backwards compatible extension to the HBase API. (By backwards compatible, we mean older servers will simply ignore the new cell metadata, as opposed to throw an exception or fail.)

Mutation#setCellVisibility(new CellVisibility(String labelExpession));

The visibility expression can contain labels joined with logical expressions ‘&’, ‘|’ and ‘!’. Also using ‘(‘, ‘)’ one can specify the precedence order. For example, consider the label set { confidential, secret, topsecret, probationary }, where the first three are sensitivity classifications and the last describes if an employee is probationary or not. If a cell is stored with this visibility expression:

( secret | topsecret ) & !probationary

Then any user associated with the secret or topsecret label will be able to view the cell, as long as the user is not also associated with the probationary label. Furthermore, any user only associated with the confidential label, whether probationary or not, will not see the cell or even know of its existence. Accumulo users will also find HBase visibility expressions familiar, but also providing a superset of boolean operators.

We build the user’s label set in the RPC context when a request is first received by the HBase RegionServer. How users are associated with labels is pluggable. The default plugin passes through labels specified in Authorizations added to the Get or Scan. This will also be familiar to Accumulo users. 

Get#setAuthorizations(new Authorizations(String,…));

Scan#setAuthorizations(new Authorizations(String,…));

Authorizations not in the maximal set of labels granted to the user are dropped. From this point, visibility expression processing is very fast, using set operations.

In the future we envision additional plugins which may interrogate an external source when building the effective label set for a user, for example LDAP or Active Directory. Consider our earlier example. Perhaps the sensitivity classifications are attached when cells are stored into HBase, but the probationary label, determined by the user’s employment status, is provided by an external authorization service.

HBase Cell ACLs

We have extended the existing HBase ACL model to the cell level.

When storing or mutating a cell, the HBase user can now add ACLs, using a backwards compatible extension to the HBase API.

Mutation#setACL(String user, Permission perms);

Like at the table or column family level, a subject is granted permissions to the cell. Any number of permissions for any number of users (or groups using @group notation) can be added.

From then on, access checks for operations on the cell include the permissions recorded in the cell’s ACL tag, using union-of-permission semantics.

First we check table or column family level permissions*. If they grant access, we can early out before going to blockcache or disk to check the cell for ACLs. Table designers and security architects can therefore optimize for the common case by granting users permissions at the table or column family level. However, if indeed some cells require more fine grained control, if neither table nor column family checks succeed, we will enumerate the cells covered by the operation. By “covered”, this means we insure that every location which would be visibly modified by the operation has a cell ACL in place that grants access. We can stop at the first cell ACL that does not grant access.

For a Put, Increment, or Append we check the permissions for the most recent visible version. “Visible” means not covered by a delete tombstone. We treat the ACLs in each Put as timestamped like any other HBase value. A new ACL in a new Put applies to that Put. It doesn't change the ACL of any previous Put. This allows simple evolution of security policy over time without requiring expensive updates. To change the ACL at a specific { row, column, qualifier, timestamp } coordinate, a new value with a new ACL must be stored to that location exactly.

For Increments and Appends, we do the same thing as with Puts, except we will propagate any ACLs on the previous value unless the operation carries a new ACL explicitly.

Finally, for a Delete, we require write permissions on all cells covered by the delete. Unlike in the case of other mutations we need to check all visible prior versions, because a major compaction could remove them. If the user doesn't have permission to overwrite any of the visible versions ("visible", again, is defined as not covered by a tombstone already) then we have to disallow the operation.

* - For the sake of coherent explanation, this overlooks an additional feature. Optionally, on a per-operation basis, how cell ACLs are factored into the authorization decision can be flipped around. Instead of first checking table or column family level permissions, we enumerate the set of ACLs covered by the operation first, and only if there are no grants there we check for table or column family permissions. This is useful for use cases where a user is not granted table or column family permissions on a table and instead the cell level ACLs provide exceptional access. The default is useful for use cases where the user is granted table or column family permissions and cell level ACLs might withhold authorization. The default is likely to perform better. Again, which strategy is used can be specified on a per-operation basis.

Andrew 

This blog post was republished from https://communities.intel.com/community/datastack/blog/2013/10/29/hbase-cell-security

Friday Oct 25, 2013

Phoenix Guest Blog

logo

James Taylor 

I'm thrilled to be guest blogging today about Phoenix, a BSD-licensed open source SQL skin for HBase that powers the HBase use cases at Salesforce.com. As opposed to relying on map-reduce to execute SQL queries as other similar projects, Phoenix compiles your SQL query into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. On top of that, you can add secondary indexes to your tables to transform what would normally be a full table scan into a series of point gets (and we all know how good HBase performs with those).

Getting Started

To get started using phoenix, follow these directions:

  • Download and expand the latest phoenix-2.1.0-install.tar from download page
  • Add the phoenix-2.1.0.jar to the classpath of every HBase region server. An easy way to do this is to copy it into the HBase lib directory.
  • Restart all region servers
  • Start our terminal interface that is bundled with our distribution against your cluster:
          $ bin/sqlline.sh localhost
    
  • Create a Phoenix table:
          create table stock_price(ticker varchar(6), price decimal, date date);
    
  • Load some data either through one of our CSV loading tools, or by mapping an existing HBase table to Phoenix table.
  • Run some queries:
          select * from stock_price;
          select * from stock_price where ticker='CRM';
          select ticker, avg(price) from stock_price 
              where date between current_date()-30 and current_date() 
              group by ticker;
    

Questions/Comments?

Take a look at our recent announcement for our release which includes support for secondary indexing, peruse our FAQs and join our mailing list.


Tuesday Oct 22, 2013

HBase 0.96.0 Released

by Stack, the 0.96.0 Release Manager

Here are some notes on our recent hbase-0.96.0 release (For the complete list of over 2k issues addressed in 0.96.0, see Apache JIRA Release Notes).


hbase-0.96.0 was more than a year in the making.   It was heralded by three developer releases -- 0.95.0, 0.95.1 and 0.95.2 -- and it went through six release candidates before we arrived at our final assembly released on Friday, October 18th, 2013.


The big themes that drove this release gleaned of rough survey of users and our experience with HBase deploys were:


  • Improved stability: A new suite of integration cluster tests (HBASE-6241 HBASE-6201), configurable by node count, data sizing, duration, and “chaos” quotient, turned up loads of bugs around assignment and data views when scanning and fetching.  These we fixed in hbase-0.96.0.  Table locks added for cross-cluster alterations and cross-row transaction support, enabled on our system tables by default, now have us wide-berth whole classes of problematic states.

  • Scaling: HBase is being deployed on larger clusters.  How we kept schema in the filesystem or our archiving a WAL file at a time when done replicating worked fine on clusters of hundreds of nodes but made for significant friction when we moved to the next scaling level up.

  • Mean Time To Recovery (MTTR): A sustained effort in HBase and in our substrate, HDFS, narrowed the amount of time data is offline after node outage.

  • Operability: Many new tools were added to help operators of hbase clusters: from a radical redo of the metrics emissions, through a new UI  and exposed hooks for health scripts.  It is now possible to trace lagging calls down through the HBase stack (HBASE-9121 Update HTrace to 2.00 and add new example usage) to figure where time is spent and soon, through HDFS itself, with support for pretty visualizations in Twitter Zipkin (See HBase Tracing from a recent meetup).

  • Freedom to Evolve: We redid how we persisted everywhere, whether in the filesystem or up into zookeeper, but also how we carry queries and data back and forth over RPC.  Where serialization was hand-crafted when we were on Hadoop Writables, we now use generated Google protobufs.  Standardizing serialization on protobufs, with well-defined schemas, will make it easier evolving versions of the client and servers independently of each other in a compatible manner without having to take a cluster restart going forward.

  • Support for hadoop1 and hadoop2: hbase-0.96.0 will run on either.  We do not ship a universal binary.  Rather you must pick your poison; hbase-0.96.0-hadoop1 or hbase-0.96.0-hadoop2 (Differences in APIs between the two versions of Hadoop forced this delivery format).  hadoop2 is far superior to hadoop1 so we encourage you move to it.  hadoop2 has improvements that make HBase operation run smoother, facilitates better performance -- e.g. secure short-circuit reads -- as well as fixes that help our MTTR story.

  • Minimal disturbance to the API: Downstream projects should just work.  The API has been cleaned up and divided into user vs developer APIs and all has been annotated using Hadoop’s system for denoting APIs stable, evolving, or private.  That said, a load of work was invested making it so APIs were retained. Radical changes in API that were present in the last developer release were undone in late release candidates because of downstreamer feedback.


Below we dig in on a few of the themes and features shipped in 0.96.0.


Mean Time To Recovery

HBase guarantees a consistent view by having a single server at a time solely responsible for data. If this server crashes, data is ‘offline’ until another server assumes responsibility.  When we talk of improving Mean Time To Recovery in HBase, we mean narrowing the time during which data is offline after a node crash.  This offline period is made up of phases: a detection phase, a repair phase, reassignment, and finally, clients noticing the data available in its new location.  A fleet of fixes to shrink all of these distinct phases have gone into hbase-0.96.0.


In the detection phase, the default zookeeper session period has been shrunk and a sample watcher script will intercede on server outage and delete the regionservers ephemeral node so the master notices the crashed server missing sooner (HBASE-5844 Delete the region servers znode after a regions server crash). The same goes for the master (HBASE-5926  Delete the master znode after a master crash). At repair time, a running tally makes it so we replay fewer edits cutting replay time (HBASE-6659 Port HBASE-6508 Filter out edits at log split time).  A new replay mechanism has also been added (HBASE-7006 Distributed log replay -- disabled by default) that speeds recovery by skipping having to persist intermediate files in HDFS.  The HBase system table now has its own dedicated WAL, so this critical table can come back before all others (See HBASE-7213 / HBASE-8631).  Assignment all around has been speeded up by bulking up operations, removing synchronizations, and multi-threading so operations can run in parallel.


In HDFS, a new notion of ‘staleness’ was introduced (HDFS-3703, HDFS-3712).  On recovery, the namenode will avoid including stale datanodes saving on our having to first timeout against dead nodes before we can make progress (Related, HBase avoids writing a local replica when writing the WAL instead writing all replicas remote out on the cluster; the replica that was on the dead datanode is of no use come recovery time. See HBASE-6435 Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes).  Other fixes, such as HDFS-4721 Speed up lease/block recovery when DN fails and a block goes into recovery shorten the time involved assuming ownership of the last, unclosed WAL on server crash.


And there is more to come inside the 0.96.x timeline: e.g. bringing regions online immediately for writes, retained locality when regions come up on the new server because we write replicas using the ‘Favored Nodes’ feature, etc.


Be sure to visit the Reference Guide for configurations that enable and tighten MTTR; for instance, ‘staleness’ detection in HDFS needs to be enabled on the HDFS-side.  See the Reference Guide for how.


HBASE-5305 Improve cross-version compatibility & upgradeability

Freedom to Evolve

Rather than continue to hand-write serializations as is required when using Hadoop Writables, our serialization means up through hbase-0.94.x, in hbase-0.96.0 we moved the whole shebang over to protobufs.  Everywhere HBase persists we now use protobuf serializations whether writing zookeeper znodes, files in HDFS, and whenever we send data over the wire when RPC’ing.


Protobufs support evolving types, if careful, making it so we can amend Interfaces in a compatible way going forward, a freedom we were sorely missing -- or to be more precise, was painful to do -- when all serialization was by hand in Hadoop Writables.  This change breaks compatibility with previous versions.


Our RPC is also now described using protobuf Service definitions.  Generated stubs are hooked up to a derivative, stripped down version of the Hadoop RPC transport.  Our RPC now has a specification.  See the Appendix in the Reference Guide.


HBASE-8015 Support for Namespaces

Our brothers and sisters over at Yahoo! contributed table namespaces, a means of grouping tables similar to mysql’s notion of database, so they can better manage their multi-tenant deploys.  To follow in short order will be quota, resource allocation, and security all by namespace.


HBASE-4050 Rationalize metrics, metric2 framework implementation


New metrics have been added and the whole plethora given a radical edit, better categorization, naming and typing; patterns were enforced so the myriad metrics are navigable and look pretty up in JMX.  Metrics have been moved up on to the Hadoop 2 Metrics 2 Interfaces.  See Migration to the New Metrics Hotness – Metrics2 for detail.


New Region Balancer


A new balancer using an algorithm similar to Simulated annealing or Greedy Hillclimbing factors in not only region count, the only attribute considered by the old balancer, but also region read/write load, locality, among other attributes, coming up with a balance decision.

Cell


In hbase-0.96.0, we began work on a long-time effort to move off of our base KeyValue type and move instead to use a Cell Interface throughout the system.  The intent is to open up the way to try different implementations of the base type; different encodings, compressions, and layouts of content to better align with how the machine works. The move, though not yet complete, has already yielded performance gains.  The Cell Interface shows through in our hbase-0.96.0 API with the KeyValue references deprecated in 0.96.0.  All further work should be internal-only and transparent to the user.

Incompatible changes



Miscellaneous




You will need to restart your cluster to come up on hbase-0.96.0.  After deploying the binaries, run a checker script that will look for the existence of old format HFiles no longer supported in hbase-0.96.0.  The script will warn you of their presence and will ask you to compact them away.  This can be done without disturbing current serving.  Once all have been purged, stop your cluster, and run a small migration script. The HBase migration script will upgrade the content of zookeeper and rearrange the content of the filesystem to support the new table namespaces feature.  The migration should take a few minutes at most.  Restart.  See Upgrading from 0.94.x to 0.96.x for details.


From here on out, 0.96.x point releases with bug fixes only will start showing up on a roughly monthly basis after the model established in our hbase-0.94 line.  hbase-0.98.0, our next major version, is scheduled to follow in short order (months).  You will be able to do a rolling restart off 0.96.x and up onto 0.98.0.  Guaranteed.


A big thanks goes out to all who helped make hbase-0.96.0 possible.


This release is dedicated to Shaneal Manek, HBase contributor.

Download your hbase-0.96.0 here.

Thursday Aug 08, 2013

Taking the Bait

Lars Hofhansl, Andrew Purtell, and Michael Stack

HBase Committers


Information Week recently published an article titled “Will HBase Dominate NoSQL?”. Michael Hausenblas of MapR argues the ‘For’ HBase case and Jonathan Ellis of Apache Cassandra and vendor DataStax argues ‘Against’.

It is easy to dismiss this 'debate' as vendor sales talk and just get back to using and improving Apache HBase, but this article is a particularly troubling example. Here in both the ‘For’ and ‘Against’ arguments slight is being cast on the work of the HBase community. Here are some notes by way of redress:

First, Michael argues Hadoop is growing fast and because HBase came out of Hadoop and is tightly associated, ergo, HBase is on the up and up.  It is easy to make this assumption if you are not an active participant in the HBase community (We have also come across the inverse where HBase is driving Hadoop adoption). Michael then switches to a tired HBase versus Cassandra bake off, rehashing the old consistency versus eventual-consistency wars, ultimately concluding that HBase is ‘better’ simply because Facebook, the Cassandra wellspring, dropped it to use HBase instead as the datastore for some of their large apps. We would not make that kind of argument. Whether or not one should utilize Apache HBase or Apache Cassandra depends on a number of factors and is, like with most matters of scale, too involved a discussion for sound bites.


Then Michael does a bait-and-switch where he says “..we’ve created a ‘next version’ of enterprise HBase.... We brought it into GA under the label M7 in May 2013”.  The ‘We’ in the quote does not refer to the Apache HBase community but to Michael’s employer, MapR Technologies and the ‘enterprise HBase’ he is talking of is not Apache HBase.  M7 is a proprietary product that, to the best of our knowledge, is fundamentally different architecturally from Apache HBase. We cannot say more because of the closed source nature of the product. This strikes us as an attempt to attach the credit and good will that the Apache HBase community have all built up over the years through hard work and contributions to a commercial closed source product that is NOT Apache HBase.


Now let us address Jonathan’s “Against” argument.  Some of Jonathan’s claims are no longer true: “RegionServer failover takes 10 to 15 minutes” (see HDFS-3703 and HDFS-3912), or highly subjective: “Developing against HBase is painful.”  (In our opinion, our client API is simpler and easier to use than the commonly used Cassandra client libraries.) Otherwise, we find nothing here that has not been hashed and rehashed out over the years in forums ranging from mailing lists to Quora. Jonathan is mostly listing out provence, what comes of our being tightly coupled to Apache Hadoop’s HDFS filesystem and our tracing the outline of the Google BigTable Architecture. HBase derives many benefits from HDFS and inspiration from BigTable. As a result, we excel at some use cases that would be problematic for Cassandra. The reverse is also true. Neither we nor HDFS are standing still as the hardware used by our users evolves, or as new users bring experience from new use cases.


He highlights a difference in approach.


We see our adoption of Apache HDFS, our integration with Apache Hadoop MapReduce, and our use of Apache ZooKeeper for coordination as a sound separation of concerns. HBase is built on battle tested components. This is a feature, not a bug.


Where Jonathan might see a built in query language and secondary indexing facility as necessary complications to core Cassandra, we encourage and support projects like Salesforce’s Phoenix as part of a larger ecosystem centered around Apache HBase. The Phoenix guys are able to bring domain expertise to that part of the problem while we (HBase) can focus on providing a stable and performant storage engine core. Part of what has made Apache Hadoop so successful is its ecosystem of supporting and enriching projects - an ecosystem that includes HBase. An ecosystem like that is developing around HBase.


When Jonathan veers off to talk of the HBase community being “fragmented” with divided “[l]eadership”, we think perhaps what is being referred to is the fact that the Apache HBase project is not an “owned” project, a project led by a single vendor.  Rather it is a project where multiple teams from many organizations - Cloudera, Hortonworks, Salesforce, Facebook, Yahoo, Intel, Twitter, and Taobao, to name a few - are all pushing the project forward in collaboration. Most of the Apache HBase community participates in shared effort on two branches - what is about to be our new 0.96 release, and on another branch for our current stable release, 0.94. Facebook also maintains their own branch of Apache HBase in our shared source control repository. This branch has been a source of inspiration and occasional direct contribution to other branches. We make no apologies for finding a convenient and effective collaborative model according to the needs and wishes of our community. To other projects driven by a single vendor this may seem suboptimal or even chaotic (“fragmented”).


We’ll leave it at that.


As always we welcome your participation in and contributions to our community and the Apache HBase project. We have a great product and a no-ego low-friction decision making process not beholden to any particular commercial concern. If you have an itch that needs scratching and wonder if Apache HBase is a solution, or, if you are using Apache HBase but feel it could work better for you, come talk to us.

Monday Jul 29, 2013

Data Types != Schema

by Nick Dimiduk 

My work on adding data types to HBase has come along far enough that ambiguities in the conversation are finally starting to shake out. These were issues I’d hoped to address through initial design documentation and a draft specification. Unfortunately, it’s not until there’s real code implemented that the finer points are addressed in concrete. I’d like to take a step back from the code for a moment to initiate the conversation again and hopefully clarify some points about how I’ve approached this new feature.

If you came here looking for code, you’re out of luck. Go check out the parent ticket, HBASE-8089. For those who don’t care about my personal experiences with information theory, skip down to the TL;DR section at the end. You might also be satisfied by the slides I presented on this same topic at the Hadoop Summit HBase session in June.

A database by any other name

“HBase is a database.” This is the very first statement in HBase in Action. Amandeep and I debated, both between ourselves and with a few of our confidants, as to whether we should make that statement. The concern in my mind was never over the validity of the claim, but rather how it would be interpreted. “Database” has come to encompass a great deal of technologies and features, many of those features HBase doesn’t (yet) support. The confusion is worsened by the recent popularity of non-relational databases under the umbrella title NoSQL, a term which itself is confused [1]. In this post, I hope to tease apart some of these related ideas.

My experience with data persistence systems started with my first position out of university. I worked for a small company whose product, at its core, was a hierarchical database. That database had a very bare concept of tables and there was no SQL interface. It’s primary construct was a hierarchy of nodes and its query engine was very good at traversing that hierarchy. The hierarchy was also all it exposed to its consumers, and querying the database was semantically equivalent to walking that hierarchy and executing functions on an individual node and its children. The only way to communicate with it was via a Java API, or later, the C++ interface. For a very long time, the only data type it could persist was a C-style char[]. Yet a client could connect to the database server over the network, persist data into it, and issue queries to retrieve previously persisted data and transformed versions thereof. It didn’t support SQL, it only spoke in Strings, but it was a database.

Under the hood, this data storage system used an open source, embedded database library with which you’re very likely familiar. The API for that database exposed a linear sequence of pages allocated from disk. Each page held a byte[]. You can think of the whole database as a persisted byte[][]. Queries against that database involved requesting a specific page by its ID and it returned to you the raw block of data that resided there. Our database engine delegated persistence responsibilities to that system, using it to manage it’s own concepts and data structures in a format that could be serialized to bytes on disk. Indeed, that embedded database library delegated much of its own persistence responsibilities to the operating system’s filesystem implementation.

In common usage, the word “database” tends to be shorthand for Relational Database Management System. Neither the hierarchical database, the embedded database, nor the filesystem qualify by this definition. Yet all three persist and retrieve data according to well defined semantics. HBase is also not a Relational Database Management System, but it persists and retrieves data according to well defined semantics. HBase is a database.

Data management as a continuum

Please bear with me as I wander blindly into the world of information theory.

I think of data management as a continuum. On the one extreme, we have the raw physical substrate on which information is expressed as matter. On the far other extreme is our ability to reason and form understandings about a catalog of knowledge. In computer systems, we narrow the scope of that continuum, but it’s still a range from the physical allocation of bits to a structure that can be interpreted by humans.

physical bits                             meaning
     |                                       |
     |-------------------|-------------------|
     |                                       |
                      database

A database provides an interface over which the persisted data is available for interaction. Those interacting with it are generally technical humans and systems that expose that data to non-technical humans through applications. The humans’ goal is primarily to derive meaning from that persisted data. The RDBMS exposes an interface that’s a lot closer to the human level than HBase or my other examples. That interface is brought closer in large part because the RDBMS includes a system of metadata description and management called aschema. Exposing a schema, a way to describe the data physically persisted, acts as a bridge from database to non-technical humans. It allows the human to describe the information they want to persist in a way that has meaning to both the human and the database.

A schema is metadata. It’s a description of the shape of the data and also provides hints to its intended meaning. Computer systems represent data as sequences of binary data. The schema helps us make sense of those digits. A schema can tell us that 0x01c7c6 represents the numeric value 99.99 which means “the average price of your monthly mobile phone bill.”

In addition to providing data management tools, most RDBMSs provided schema management tools. Managing schema is just as important as managing the data itself. I say that because without schema, how can I begin to understand what this collection of data means? As knowledge and needs change, so too does data. Just as the data management tools provide a method for changing data values, the schema management tools provide a method for tracking the change in meaning of the data.

From here to there

A database does not get a schema for free. In order to describe meaning of persisted data, a schema needs a few building block concepts. Relational systems derive their name from the system of relational algebra upon which they describe their data and its access. A table contains records that all conform to a particular shape and that shape is described by a sequence of labeled columns. Tables often represent something specific and the columns describe attributes of that something. As humans, we often find it helpful to describe the domain of valid values an attribute can take. 99.99 makes sense for the average price described above, while hello worlddoes not. A layer of abstraction is introduced, and we might describe the range of valid values for this average price as a numeric value representing a unit of currency with up to two decimals of precision between the range of 0.00 and 9999.99. We describe that aspect of the schema as a data type.

The “currency” data type we just defined allows us to be more specific about the meaning of the attribute in our schema. Better still, if we can describe the data type to our database, we can let it monitor and constrain the accepted values used in that attribute. That’s helpful for humans because the computer is probably better as managing these constraints than we are. It’s also helpful for the database because it no longer needs to store “any of the things” in this attribute. Instead, it only must store any of the valid values of this data type. That allows it to optimize the way it stores those values and potentially provide other meaningful operations on those values. With a data type defined, it opens the database to be able to answer queries about the data and ranges of data instead of just persisting and retrieving values. “What’s the lowest average price?” “What’s the highest?” “By how much do the average prices deviate?”

The filesystem upon which my hierarchical database sat couldn’t answer questions like those.

Data types bridge the gap between persistence layers and schema, allowing the database to share in the responsibility of value constraint management and allowing it to do more than just persist values. But data types are only half of the story. Just because I’ve declared an attribute to be of type “numeric” doesn’t mean the database can persist it. A data type can be implemented that honors the constraints upon numerical values, but there’s still another step between my value and the sequence of binary digits. That step is the encoding.

An encoding is a way to represent a value in binary digits. The simplest encoding for integer values is the representation of that number in base-2; this is a literal representation in binary digits. Encodings come with limitations though, this one included. In this case, it provides no means to represent a negative integer value. The 2’s compliment encoding has the advantage of being able to represent negative integers. It also enjoys the convenience that most arithmetic operations on values in this encoding behave naturally.

Binary coded decimal is another encoding for integer values. It has different properties and different advantages and disadvantages than 2’s compliment. Both are equally valid ways to represent integers as a sequence of binary data. Thus an integer data type, honoring all the constraints of integer values can be encoded in multiple ways. Continuing the example, just like there are multiple valid relational schema designs to derive meaning over a data set of mobile subscribers, so too are there multiple valid encodings to represent an integer value [2].

Data types for HBase

Thus far in its lifetime, HBase has provided data persistence. It does so in a rather unique way as compared to other databases. That method of persistence influences the semantics it exposes around persisting and retrieving the data it stores. To date, those semantics have exposed a very simple logical data model, that of a sorted, nested map of maps. That data model is heavily influenced by the physical data model of the database implementation.

Technically this data model is a schema because it defines a logical structure for data, complete with a data type. However, this model is very rudimentary as schemas go. It provides very few facilities for mapping application-level meaning to physical layout. The only data type this logical data model exposes is the humble byte[] and its encoding is a simple no-op [3].

While the byte[] is strictly sufficient, it’s not particularly convenient for application developers. I don’t want to think about my average subscription price as a byte[], but rather as a value conforming to the numeric type described earlier. HBase requires that I accept the burden of both data type constraint maintenance and data value encoding into my application.

HBase does provide a number of data encodings for Java languages primitive types. These encodings are implemented in the toXXX methods on the Bytes class. These methods transform the Java types into byte[]and back again. The trouble is they (mostly, partially) don’t preserve the sort order of the values they represent. This is a problem.

HBase’s semantics of a sorted map of maps is extremely important in designing table layouts for applications. The sort order influences physical layout of data on disk, which has direct impact on data access latency. The practice of HBase “schema design” is the task of laying out your data physically so as to minimize the latency of the access patterns that are important to your application. A major aspect of that is in designing a rowkey that orders well for the application. Because the default encodings do not always honor the natural sorting of the values they represent, it can become difficult to reason about application performance. Application developers are left to devise their own encoding systems that honor the natural sorting of any data types they wish to use.

Doing more for application developers

In HBASE-8089, I proposed that we expand the set of data types that HBase exposes to include a number of new members. The idea being that those other types make it easier for developers to build applications. It includes an initial list of suggested data types and some proposals about how they might be implemented, hinting toward considerations of order-preserving encodings.

HBASE-8201 defines a new utility class for data encodings called OrderedBytes. The encodings implemented there are designed primarily to produce byte[]s that preserve the natural sort order of the values they represent. They are also implemented in such a way as to be self-identifying. Meaning, an encoded value can be inspected to determine which encoding it represents. This last feature makes the encoding scheme at least rudimentarily machine readable and is particularly valuable, in my opinion. It enables reader tools (a raw data inspector, for example) to be encoding aware even in the absence of knowledge about schema or data types.

HBASE-8693 advocates an extensible data type API, so that application developers can easily introduce new data types. This allows HBase applications to implement their own data types that the HBase community hasn’t thought of or does not think are appropriate to ship with the core system. The working implementation of that DataType API and a number of pre-supported data types are provided. Those data types are built on the two codecs, Bytes and OrderedBytes. That means application developers will have access to basic constructs like number and Strings in addition to byte[]. It also means that sophisticated users can develop highly specialized data types for use as the foundation of extensions to HBase. My hope is this will make extension efforts on par with PostGIS possible in HBase.

Please take note that nothing here approaches the topic of schema or schema management. My personal opinion is that not enough complex applications have been written against HBase for the project to ship with such a system out of the box. Two notable efforts which are imposing schema onto HBase are Phoenix and Kiji. The former seeks to translate a subset of the Relational model onto HBase. The latter is devising its own solution, presumably modeled after its authors’ own experiences. In both cases, I hope these projects can benefit by HBase providing some additional data encodings and an API for user-extensible data types.

Conclusions

It’s an exciting time to be building large, data-driven applications. We enjoy a wealth of new tools, not just HBase, to make that easier than ever before. Still, those same tools that make things possible are still in infant stages of usability. Hopefully these efforts will move the conversation forward. Please take a moment to review these tickets. Let us know what data types we haven’t thought about and what encoding schemes you fancy. Poke holes in the data type extension API and provide counter examples that you can’t implement due to lack of expression. Take this opportunity to customize your tools to better fit your own hands.

Notes

[1] The term NoSQL is used to reference pretty much anything that stores data and didn’t exist a decade ago. This includes quite an array of data tools. Amandeep and I studied and summarized the landscape in a short body of work that was cut from the book. Key-value stores, graph databases, in-memory stores, object stores, and hybrids of the above all make the cut. About all they can agree on is they don’t like using the relational model to describe their data.

[2] For more fun, check out the ZigZag encoding described in the documentation of Google’s protobuf.

[3] That’s almost true. HBase also provides a uint64 data type exposed through the Increment API.


Wednesday May 01, 2013

Migration to the New Metrics Hotness – Metrics2

by Elliott Clarke

HBase Committer and Cloudera Engineer 

NOTE: This blog post describes the server code of HBase. It assumes a general knowledge of the system. You can read more about HBase in the blog posts here.

Introduction

HBase is a distributed big data store modeled after Google’s Bigtable paper. As with all distributed systems, knowing what’s happening at a given time can help  spot problems before they arise, debug on-going issues, evaluate new usage patterns, and provide insight into capacity planning.

Since October 2008, version 0.19.0 , (HBASE-625) HBase has been using Hadoop’s metrics system to export metrics to JMX, Ganglia, and other metrics sinks. As the code base grew, more and more metrics were added by different developers. New features got metrics. When users needed more data on issues, they added more metrics. These new metrics were not always consistently named, and some were not well documented.

As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started.  I also wanted to clean up the implementation cruft that had accumulated.

Definitions

The implementation details are pretty dense on terminology so lets make sure everything is defined:

  • Metric: A measurement of a property in the system.

  • Snapshot: A set of metrics at a given point in time.

  • Metrics1: The old Apache Hadoop metrics system.

  • Metrics2: The new overhauled Apache Hadoop Metrics system.

  • Source: A class that exposes metrics to the Hadoop metrics system.

  • Sink: A class that receives metrics snapshots from the Hadoop metrics system.

  • JMX: Java Management Extension. A system built into java that facilitates the management of java processes over a network; it includes the ability to expose metrics.

  • Dynamic Metrics: Metrics that come and go. These metrics are not all known at compile time; instead they are discovered at runtime.

Implementation

The Hadoop Metrics2 system implementations in branch-1 and branch-2 have diverged pretty drastically. This means that a single implementation of the code to move metrics from HBase to metrics2 sinks would not be performant or easy. As a result I created different hadoop compatibility shims and a system to load a version at runtime. This led to using ServiceLoader to create an instance of any class that touched parts of Hadoop that had changed between branch-1 and branch-2.

Here is an example of how a region server could request a Hadoop 2 version of the shim for exposing metrics about the HRegionServer. (Hadoop 1’s compatibility jar is shown in dotted lines to indicate that it could be swapped in if Hadoop 1 was being used)



This system allows HBase to support both Hadoop 1.x and Hadoop 2.x implementations without using reflection or other tricks to get around differences in API, usage, and naming.

Now that HBase can use either the Hadoop 1 or Hadoop 2 versions of the metrics 2 systems, I set about cleaning up what metrics HBase exposes, how those metrics are exposed, naming, and performance of gathering the data.

Metrics2 uses either annotations or sources to expose metrics. Since HBase can’t require any part of the metrics2 system in the core classes I exposed all metrics from HBase by creating sources. For metrics that are known ahead of time I created wrappers around classes in the core of HBase that the metrics2 shims could interrogate for values. Here is an example on how HRegionServer’s metrics(the non-dynamic metrics) are exposed :



The above pattern can be repeated to expose a great deal of the metrics that HBase has. However metrics about specific regions are still very interesting but can’t be exposed following the above pattern. So a new solution that would allow metrics about regions to be exposed by whichever HRegionServer is hosting that region was needed. To complicate things further Hadoop’s metrics2 system needs one MetricsSource to be responsible for all metrics that are going to be exposed through a JMX mbean. In order for metrics about regions to be well laid out, HBase needs a way to aggregate metrics from multiple regions into one source. This source will then be responsible for knowing what regions are assigned to the regionserver. These requirements led me to have one aggregation source that contains sudo-sources for each region. These sudo-sources each contain a wrapper around the region. This leads to something that looks like this:

Benefits

That’s a lot of work to re-do a previously working metrics system, so what was gained by all this work? The entire system is much easier to test in unit and systems tests. The whole system has been made more regular; that is everything follows the same patterns and naming conventions. Finally everything has been rewritten to be faster.

Since the previous metrics have all been added on as needed they were not all named well. Some metrics were named following the pattern: “metricNameCount” others were named following “numMetricName” while still others were named like “metricName_Count”. This made parsing hard and gave a generally chaotic feel. After the overhaul metrics that are a counter start with the camel cased metric name followed by the suffix “Count.” The mbeans were poorly laid out. Some metrics we spread out between two mbeans. Metrics about a region were under an mbean named Dynamic, not the most descriptive name. Now mbeans are much better organized and have better descriptions.

Tests have found that single threaded scans run as much as 9% faster after HBase’s old metrics system has been replaced. The previous system used lots of ConcurrentHashMap’s to store dynamic metrics. All accesses to mutate these metrics required a lookup into these large hash maps. The new system minimizes the use of maps. Instead every region or server exports metrics to one pseudo source. The only changes to hashmaps in the metrics system occurs on region close or open.

Conclusions

Overall the whole system is just better. The process was long and laborious, but worth it to make sure that HBase’s metrics system is in a good state. HBase 0.95, and later 0.96, will have the new metrics system.  There’s still more work to be completed but great strides have been made.

Thursday Apr 25, 2013

HBase - Who needs a Master?

By Matteo Bertozzi (mbertozzi at apache dot org), HBase Committer and Engineer on the Cloudera HBase Team.

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.


HBase provides low-latency random reads and writes on top of HDFS and it’s able to handle petabytes of data. One of the interesting capabilities in HBase is Auto-Sharding, which simply means that tables are dynamically distributed by the system when they become too large.

Regions and Region Servers

The basic unit of scalability, that provides the horizontal scalability, in HBase is called Region. Regions are a subset of the table’s data, and they are essentially a contiguous, sorted range of rows that are stored together.

Initially, there is only one region for a table.  When regions become too large after adding more rows, the region is split into two at the middle key, creating two roughly equal halves.

Looking back at the HBase architecture the slaves are called Region Servers. Each Region Server is responsible to serve a set of regions, and one Region (i.e. range of rows) can be served only by one Region Server.


The HBase Architecture has two main services: HMaster that is responsible for coordinating Regions in the cluster and execute administrative operations; HRegionServer responsible to handle a subset of the table’s data.

HMaster, Region Assignment and Balancing

The HBase Master coordinates the HBase Cluster and is responsible for administrative operations.


A Region Server can serve one or more Regions. Each Region is assigned to a Region Server on startup and the master can decide to move a Region from one Region Server to another as the result of a load balance operation. The Master also handles Region Server failures by assigning the region to another Region Server.


The mapping of Regions to Region Servers is kept in a system table called META. By reading META, you can identify which region is responsible for your key. This means that for read and write operations, the master is not involved at all, and clients can go directly to the Region Server responsible to serve the requested data.

Locating a Row-Key: Which Region Server is responsible?

To put or get a row clients don’t have to contact the master, clients can contact directly  the Region Server that handles the specified row. In the case of a scan clients can contact directly the set of Region Servers responsible for handling the set of keys.

To identify the Region Server, the client does a query on the META table.META is a system table, used to keep track of regions. It contains the server name and a region identifier comprised of a table name and the start row-key. By looking at the start-key and the next region start-key clients are able to identify the range of rows contained in a a particular region.

The client keeps a cache for the region locations. This avoids the need for clients to hit the META table every time an operation on the same region is issued. In case of a region split or move to another Region Server (due to balancing, or assignment policies) the client will receive an exception as response and the cache will be refreshed by fetching the updated information from the META table.


Since META is a table like the others, the client has to identify on which server META is located. The META locations are stored in a ZooKeeper node on assignment by the Master, and the client reads directly the node to get the address of the Region Server that contains META.


The original design was based on BigTable with another table called -ROOT- containing the META locations and ZooKeeper pointing to it. HBase 0.96 removed that in favor of ZooKeeper only since META cannot be split and therefore consists of a single region.

Client API - Master and Regions responsibilities

The HBase java client API is composed of two main interfaces.

  • HBaseAdmin: allows interaction with the “table schema" by creating/deleting/modifying tables, and it allows interaction with the cluster by assigning/unassigning regions, merging regions together, calling for a flush, and so on.  This interface communicates with the Master.

  • HTable: allows the client to manipulate the data of a specified table, by using get, put, delete and all the other data operations.  This interface communicates directly with the Region Servers responsible for handling the requested set of keys.

Those two interfaces have separate responsibilities: HBaseAdmin is only used to execute admin operations and communicate with the Master while the HTable is used to manipulate data and communicate with the Regions.

Conclusion

As we’ve seen in this article, having a Master/Slave architecture does not mean that each operation goes through the master. To read and write data the HBase client, in fact, goes directly to the specific Region Server responsible to handle the row keys for all the data operations (HTable). The Master is used by the client only for table creation, modification and deletion operations (HBaseAdmin).

Although there exists a concept of a Master, the HBase client does not depend on it for data operations and the cluster can keep serving data even if the master goes down. 

Friday Jan 11, 2013

Apache HBase Internals: Locking and Multiversion Concurrency Control

by Gregory Chanan

HBase Committer

Cloudera, Inc. 

NOTE: This blog post describes how Apache HBase does concurrency control.  This assumes knowledge of the HBase write path, which you can read more about in this other blog post.

Introduction

Apache HBase provides a consistent and understandable data model to the user while still offering high performance.  In this blog, we’ll first discuss the guarantees of the HBase data model and how they differ from those of a traditional database.  Next, we’ll motivate the need for concurrency control by studying concurrent writes and then introduce a simple concurrency control solution.  Finally, we’ll study read/write concurrency control and discuss an efficient solution called Multiversion Concurrency Control.

Why Concurrency Control?

In order to understand HBase concurrency control, we first need to understand why HBase needs concurrency control at all; in other words, what properties does HBase guarantee about the data that requires concurrency control?

The answer is that HBase guarantees ACID semantics per-row.  ACID is an acronym for:
• Atomicity: All parts of transaction complete or none complete
• Consistency: Only valid data written to database
• Isolation: Parallel transactions do not impact each other’s execution
• Durability: Once transaction committed, it remains

If you have experience with traditional relational databases, these terms may be familiar to you.  Traditional relational databases typically provide ACID semantics across all the data in the database; for performance reasons, HBase only provides ACID semantics on a per-row basis.  If you are not familiar with these terms, don’t worry.  Instead of dwelling on the precise definitions, let’s look at a couple of examples.

Writes and Write-Write Synchronization

Consider two concurrent writes to HBase that represent {company, role} combinations I’ve held:



Image 1.  Two writes to the same row

From the previously cited Write Path Blog Post, we know that HBase will perform the following steps for each write:

(1) Write to Write-Ahead-Log (WAL)
(2) Update MemStore: write each data cell [the (row, column) pair] to the memstore
List 1. Simple list of write steps

That is, we write to the WAL for disaster recovery purposes and then update an in-memory copy (MemStore) of the data.

Now, assume we have no concurrency control over the writes and consider the following order of events:



Image 2.  One possible order of events for two writes

At the end, we are left with the following state:


Image 3.  Inconsistent result in absence of write-write synchronization

which is a role I’ve never held.  In ACID terms, we have not provided Isolation for the writes, as the two writes became intermixed.

We clearly need some concurrency control.  The simplest solution is to provide exclusive locks per row in order to provide isolation for writes that update the same row.  So, our new list of steps for writes is as follows (new steps are in blue).

(0) Obtain Row Lock
(1) Write to Write-Ahead-Log (WAL)
(2) Update MemStore: write each cell to the memstore
(3) Release Row Lock
List 2: List of write-steps with write-write synchronization

Read-Write Synchronization

So far, we’ve added row locks to writes in order to guarantee ACID semantics.  Do we need to add any concurrency control for reads?  Let’s consider another order of events for our example above (Note that this order follows the rules in List 2):
Image 4.  One possible order of operations for two writes and a read

Assume no concurrency control for reads and that we request a read concurrently with the two writes.  Assume the read is executed directly before “Waiter” is written to the MemStore; this read action is represented by a red line above.  In that case, we will again read the inconsistent row:


Image 5.  Inconsistent result in absence of read-write synchronization

Therefore, we need some concurrency control to deal with read-write synchronization.  The simplest solution would be to have the reads obtain and release the row locks in the same manner as the writes.  This would resolve the ACID violation, but the downside is that our reads and writes would both contend for the row locks, slowing each other down.

Instead, HBase uses a form of Multiversion Concurrency Control (MVCC) to avoid requiring the reads to obtain row locks.  Multiversion Concurrency Control works in HBase as follows:

For writes:
(w1) After acquiring the RowLock, each write operation is immediately assigned a write number
(w2) Each data cell in the write stores its write number.
(w3) A write operation completes by declaring it is finished with the write number.

For reads:
(r1)  Each read operation is first assigned a read timestamp, called a read point.
(r2)  The read point is assigned to be the highest integer such that all writes with write number <= x have been completed.
(r3)  A read r for a certain (row, column) combination returns the data cell with the matching (row, column) whose write number is the largest value that is less than or equal to the read point of r.
List 3. Multiversion Concurrency Control steps

Let’s look at the operations in Image 4 again, this time using MultiVersion Concurrency Control:
Image 6.  Write steps with Multiversion Concurrency Control

Notice the new steps introduced for Multiversion Concurrency Control.  Each write is assigned a write number (step w1), each data cell is written to the memstore with its write number (step w2, e.g. “Cloudera [wn=1]”) and each write completes by finishing its write number (step w3).

Now, let’s consider the read in Image 4, i.e. a read that begins after step “Restaurant [wn=2]” but before the step “Waiter [wn=2]”.  From rule r1 and r2, its read point will be assigned to 1.  From r3, it will read the values with write number of 1, leaving us with:

Image 7.  Consistent answer with Multiversion Concurrency Control

A consistent response without requiring locking the row for the reads!

Let’s put this all together by listing the steps for a write with Multiversion Concurrency Control: (new steps required for read-write synchronization are in red):
(0) Obtain Row Lock
(0a) Acquire New Write Number
(1) Write to Write-Ahead-Log (WAL)
(2) Update MemStore: write each cell to the memstore
(2a) Finish Write Number
(3) Release Row Lock

Conclusion

In this blog we first defined HBase’s row-level ACID guarantees.  We then demonstrated the need for concurrency control by studying concurrent writes and introduced a row-level locking solution.  Finally, we investigated read-write concurrency control and presented an efficient mechanism called Multiversion Concurrency Control (MVCC).

This blog post is accurate as of HBase 0.92.  HBase 0.94 has various optimizations, e.g. HBASE-5541 that will be described in a future blog post.

 

Thursday Mar 29, 2012

HBase Project Management Committee Meeting Minutes, March 27th, 2012

The bulk of the HBase Project Management Committed (PMC) met at the StumbleUpon offices in San Francisco, March 27th, 2012, ahead of the HBase Meetup that happened later in the evening. Below we post the agenda and minutes to let you all know what was discussed but also to solicit input and comment.

Agenda

A suggested agenda had been put on the PMC mailing list the day before the meeting soliciting that folks add their own items ahead of the meeting. The hope was that we'd put the agenda out on the dev mailing list before the meeting started to garner more suggestions but that didn’t happen. The agenda items below sort of followed on from the contributor pow-wow we had at salesforce a while back -- pow-wow agenda, pow-wow minutes (or see summarized version) -- but it was thought that we could go into more depth on a few of the topics raised then, in particular, project existential matters.

Here were the items put up for discussion:

  • Where do we want HBase to be in two years? What will success look like? Discuss. Make a short list.
  • How do we achieve the just made short list? What resources do we have to hand? What can we deploy in the short term to help achieve our just stated objectives? What do we need to prioritize in near future? Make a short list.
  • How do we exchange best practices/design decisions when developing new features? Sometimes there may be more things that can be shared if everyone follows the same best practices, and less features need to be implemented.

Minutes 

Attendees

  • Todd Lipcon
  • Ted Yu
  • Jean-Daniel Cryans
  • Jon Hsieh
  • Karthik Ranganathan
  • Kannan Muthukkaruppan
  • Andrew Purtell
  • Jon Gray
  • Nicolas Spiegelberg
  • Gary Helmling
  • Mikhail Bautin
  • Lars Hofhansl
  • Michael Stack


Todd, the secretary, took notes. St.Ack summarized his notes into the below minutes. The meeting started at about 4:20 (after the requisite 20 minutes dicking w/ A/V/ and dial-in setup).

“Where do we want HBase to be in two years? What will success look like? Discuss. Make a short (actionable) list.”

Secondary indexes and transactions were suggested as was operations on a parity w/ MySQL and rock solid stable so it could be used as primary copy of data. It was also suggested that we expend effort making HBase more usable out of the box (auto-tuning/defaults). Then followed discussion of who is HBase for? Big companies? Or new users, or startups? Is our goal stimulating demand and creating demand? Or is it to be reactive to what problems people are actually hitting? A dose of reality had it that while it would be nice to make all possible users happy, and even to talk about doing this, in actuality, we are mostly going to focus on what our employers need rather than prospective ‘customers’.

After this detour, the list making became more blunt and basic. It was suggested that we build a rock solid db which other people might be able to build on top of for higher-level stuff. The core engine needs to work reliably -- lets do this first -- and then talk of new features and add-ons. Meantime, we can point users to coprocessers for building extensions and features w/o their needing to touch core hbase (It was noted that we are open to extending CPs if we have to to extend the ‘control surface’ exposed but that coverage has been pretty much sufficient up to this). Usability is important but operability is more important. Don’t need to target n00bs. First nail ops being able to understand whats going on.

After further banter, we arrived at list: reliability, operability (insight into the running application, dynamic config. changes, usability improvements that make it easier on a clueful ops), and performance (in this order). It was offered that we are not too bad on performance -- especially in 0.94 -- and that use cases will drive the performance improvements so focus should be on the first two items in the list.

“How do we achieve the just made short list?”

To improve reliability, testing has to be better. This has been said repeatedly in the past. It was noted that progress has been made at least on our unit test story (Tests have been //ized, more of hbase is now testable because of refactorings). Progress on integration tests and or contributions to Apache Bigtop have not progressed. As is, BigTop is a little "cryptic"; its a different harness with shim layers to insulate against version changes. We should help make it easier. We should add being able to do fault injection. Should hbase integration tests be done out in the open continuously running on something like an EC2 cluster that all can observe? This is the BigTop goal but its not yet there. Of note, EC2 can’t be used validating performance. Too erratic. Bottom line, improving testing requires bodies. Resources such as hardware, etc., are relatively easy to come by. Hard is getting an owner and bodies to write the tests and test infrastructure. Each of us has our own hack testing setup. Doing a general test infrastructure whether on BigTop or not is a bit of chicken and egg problem. Lets just make sure that who ever hits the need to do this general test infrastructure tooling first, that they do it out in the open and that we all pile on and help. Meantime we'll keep up w/ our custom test tools.

Regards the current state of reliability, its thought that as is, we can’t run a cluster continuously w/ a chaos monkey operating. There are probably still “hundreds of issues” to be fixed before we’re at a state where we can run for days under “chaos monkey” conditions. For example, a recent experiment killing a rack in a large cluster and left it down for an hour. This turned up lots of bugs (On the other hand, we learned that an experiment done by another company recently where the downtime was less ‘recovered’ without problems). Areas to explore improving reliability include testing network failure scenarios, better MTTR, and better toleration of network partitions.

Also on reliability, what core fixes are outstanding? There are still races in the master and issues where bulk cluster operations -- enable/disable -- fail in the middle. We need zookeeper-intent log (or something like Accumulo FATE) and table read/write locks. Meantime, kudos to the recently checked in hbck work because this can do fixup until missing fixes are put in place.

Regards operability, we need more metrics -- e.g. 95th/99th percentile metrics (some of these just got checked in) -- and better stories around backups, point in time recovery, and cross-column family snapshots. We need to encourage/facilitate/deploy move to HA NN. Talk was about more metrics client-side rather than on the server. On metrics, what else do we need to expose? Log messages should be ‘actionable’; include help on what you need to do to solve an issue. Dynamic config. is going to help operability (here soon); we need to make sure that the important configs are indeed changeable on the fly.

 Its thought that performance is taking care of itself (witness the suite of changes in 0.94.0).

“How do we exchange best practices, etc, when developing new features?”

Many are solving similar problems. We don’t always need to build new features. Maybe ops tricks are enough in many cases (two clusters if need two applications isolated rather than build a bunch of multi-tenacy code). People often go deep into design/code, then post the design + code only after they already spent a long time. Suggested that first should be discussion with the community early before writing code and designs. Regards any features proposal, its thought that the ‘why’ is the most important thing that needs justifying, not necessarily the technical design. Also, testability and disruption of core needs to be considered proposing new feature. Designs or any document needs to be short. Two pages at max. Hard for folks to spend the time if it goes on longer (Hadoop’ HEP process was mentioned at a proposal that failed).

General Discussion 

A general discussion went around on what to do about features not wanted whether because of the justification, the design, or the code and of how the latter is hardest to deal with especially if a feature large (Code review takes time). Was noted that we don’t have to commit everything, that we can revert stuff, and that its ok to throw something away even if a bunch of work has been done (A recent fb example around reuse of block cache blocks was cited where a bunch of code and review resulted in conclusion that the benefits were inconclusive so the project was abandoned). It was noted that the onus is on us to help contributors better understand what would be good things to work on moving the project forward. Was suggested that rather than a ‘sea of jiras’, instead we’d surface a short-list of what we think an important list of things to work on. General assent that roadmaps don’t work but should be easy to put up list of whats important in near and long term future for prospective contributors to pick up on. Was noted though that we also need to remain open to new features, etc., that don’t fit near-or-far-term project themes. This is open source after all. Was mentioned that we should work on high-level guidelines for how best to contribute. These need to be made public. Make a start, any kinda of start, on defining the project focus / design rationale. It doesn’t have to be perfect - just put a few words on the page, maybe even up on the home page.

Meeting was adjourned around 5:45pm.

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation