Apache HBase

Friday April 11, 2014

The Effect of ColumnFamily, RowKey and KeyValue Design on HFile Size

By Doug Meil, HBase Committer and Thomas Murphy

Intro

One of the most common questions in the HBase user community is estimating disk footprint of tables, which translates into HFile size – the internal file format in HBase.

We designed an experiment at Explorys where we ran combinations of design time options (rowkey length, column name length, row storage approach) and runtime options (HBase ColumnFamily compression, HBase data block encoding options) to determine these factors’ effects on the resultant HFile size in HDFS.

HBase Environment

CDH4.3.0 (HBase 0.94.6.1)

Design Time Choices

  1. Rowkey

    1. Thin

      1. 16-byte MD5 hash of an integer.

    2. Fat

      1. 64-byte SHA-256 hash of an integer.

    1. Note: neither of these are realistic rowkeys for real applications, but they chosen because they are easy to generate and one is a lot bigger than the other.

  1. Column Names

    1. Thin

      1. 2-3 character column names (c1, c2).

    2. Fat

      1. 10 characters, randomly chosen but consistent for all rows.

    1. Note: it is advisable to have small column names, but most people don’t start that way so we have this as an option.

  1. Row Storage Approach

    1. Key Value Per Column

      1. This is the traditional way of storing data in HBase.

    2. One Key Value per row.

      1. Actually, two.

      2. One KV has an Avro serialized byte array containing all the data from the row.

      3. Another KV holds an MD5 hash of the version of the Avro schema.

Run Time

  1. Column Family Compression

    1. None

    2. GZ

    3. LZ4

    4. LZO

    5. Snappy


    1. Note: it is generally advisable to use compression, but what if you didn’t? So we tested that too.

  1. HBase Block Encoding

    1. None

    2. Prefix

    3. Diff

    4. Fast Diff

    1. Note: most people aren’t familiar with HBase Data Block Encoding. Primarily intended for squeezing more data into the block cache, it has effects on HFile size too. See HBASE-4218 for more detail.

1000 rows were generated for each combination of table parameters. Not a ton of data, but we don’t necessarily need a ton of data to see the varying size of the table. There were 30 columns per row comprised of 10 strings (each filled with 20 bytes of random characters), 10 integers (random numbers) and 10 longs (also random numbers).

HBase blocksize was 128k.


Results

The easiest way to navigate the results is to compare specific cases, progressing from an initial implementation of a table to options for production.

Case #1: Fat Rowkey and Fat Column Names, Now What?

This is where most people start with HBase. Rowkeys are not as optimal as they should be (i.e., the Fat rowkey case) and column names are also inflated (Fat column-names).

Without CF Compression or Data Block Encoding, the baseline is:

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY-FATCOL

6,293,670

1000

NONE

NONE

What if we just changed CF compression?

This drastically changes the HFile footprint. Snappy compression reduces the HFile size from 6.2 Mb to 1.8 Mb, for example.

1,362,033

1000

GZ

NONE

1,803,240

1000

SNAPPY

NONE

1,919,265

1000

LZ4

NONE

1,950,306

1000

LZO

NONE

However, we shouldn’t be too quick to celebrate. Remember that this is just the disk footprint. Over the wire the data is uncompressed, so 6.2 Mb is still being transferred from RegionServer to Client when doing a Scan over the entire table.

What if we just changed data block encoding?

Compression isn’t the only option though. Even without compression, we can change the data block encoding and also achieve HFile reduction. All options are an improvement over the 6.2 Mb baseline.


1,491,000

1000

NONE

DIFF

1,492,155

1000

NONE

FAST_DIFF

2,244,963

1000

NONE

PREFIX

Combination

The following table shows the results of all remaining CF compression / data block encoding combinations.

1,146,675

1000

GZ

DIFF

1,200,471

1000

GZ

FAST_DIFF

1,274,265

1000

GZ

PREFIX

1,350,483

1000

SNAPPY

DIFF

1,358,190

1000

LZ4

DIFF

1,391,016

1000

SNAPPY

FAST_DIFF

1,402,614

1000

LZ4

FAST_DIFF

1,406,334

1000

LZO

FAST_DIFF

1,541,151

1000

SNAPPY

PREFIX

1,597,440

1000

LZO

PREFIX

1,622,313

1000

LZ4

PREFIX

Case #2: What if we re-designed the column names (and left the rowkey alone)?

Let’s assume that we re-designed our column names but left the rowkey alone. After using the “thin” column-names without CF compression or data block encoding it results in an HFile 5.8 Mb in size. This is an improvement from the original 6.2 Mb baseline. It doesn’t seem like much, but it’s still a 6.5% reduction in the eventual wire-transfer footprint.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATKEY

5,778,888

1000

NONE

NONE

Applying Snappy compression can reduce the HFile size further:

1,349,451

1000

SNAPPY

DIFF

1,390,422

1000

SNAPPY

FAST_DIFF

1,536,540

1000

SNAPPY

PREFIX

1,785,480

1000

SNAPPY

NONE

Case #3: What if we re-designed the rowkey (and left the column names alone)?

In this example, what if we only redesigned the rowkey? After using the “thin” rowkey the result is an HFile size that is 4.9 Mb down from the 6.2 Mb baseline, a 21% reduction. Not a small savings!

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-FATCOL

4,920,984

1000

NONE

NONE

Applying Snappy compression can reduce the HFile size further:

1,295,895

1000

SNAPPY

DIFF

1,337,112

1000

SNAPPY

FAST_DIFF

1,489,446

1000

SNAPPY

PREFIX

1,739,871

1000

SNAPPY

NONE

However, note that the resulting HFile size with Snappy and no data block encoding (1.7 Mb) is very similar in size to the baseline approach (i.e., fat rowkeys, fat column-names) with Snappy and no data block encoding (1.8 Mb). Why? The CF compression can compensate on disk for a lot of bloat in rowkeys and column names.

Case #4: What if we re-designed both the rowkey and the column names?

By this time we’ve learned enough HBase to know that we need to have efficient rowkeys and column-names. This produces an HFile that is 4.4 Mb, a 29% savings over the baseline of 6.2 Mb.

4,406,418

1000

NONE

NONE

Applying Snappy compression can reduce the HFile size further:

1,296,402

1000

SNAPPY

DIFF

1,338,135

1000

SNAPPY

FAST_DIFF

1,485,192

1000

SNAPPY

PREFIX

1,732,746

1000

SNAPPY

NONE

Again, the on-disk footprint with compression isn’t radically different from the others, as Compression can compensate to large degree for rowkey and column name bloat.

Case #5: KeyValue Storage Approach (e.g., 1 KV vs. KV-per-Column)

What if we did something radical and changed how we stored the data in HBase? With this approach, we are using a single KeyValue per row holding all of the columns of data for the row instead of a KeyValue per column (the traditional way).

The resulting HFile, even uncompressed and without Data Block Encoding, is radically smaller at 1.4 Mb compared to 6.2 Mb.

psudorandom-table-R1000-i10-s10_20-l10-NONE-NONE-AVRO

1,374,465

1000

NONE

NONE

Adding Snappy compression and Data Block Encoding makes the resulting HFile size even smaller.

1,119,330

1000

SNAPPY

DIFF

1,129,209

1000

SNAPPY

FAST_DIFF

1,133,613

1000

SNAPPY

PREFIX

1,150,779

1000

SNAPPY

NONE

Compare the 1.1 Mb Snappy without encoding to the 1.7 Snappy encoded Thin rowkey/Thin column-name.

Summary

Although Compression and Data Block Encoding can wallpaper over bad rowkey and column-name decisions in terms of HFile size, you will pay the price for this in terms of data transfer from RegionServer to Client. Also, concealing the size penalty brings with it a performance penalty each time the data is accessed or manipulated. So, the old advice about correctly designing rowkeys and column names still holds.

In terms of KeyValue approach, having a single KeyValue per row presents significant savings both in terms of data transfer (RegionServer to Client) as well as HFile size. However, there is a consequence with this approach in having to update each row entirely, and that old versions of rows also be stored in their entirety (i.e., as opposed to column-by-column changes). Furthermore, it is impossible to scan on select columns; the whole row must be retrieved and deserialized to access any information stored in the row. The importance of understanding this tradeoff cannot be over-stated, and is something that must be evaluated on an application-by-application basis.

Software engineering is an art of managing tradeoffs, so there isn’t necessarily one “best” answer. Importantly, this experiment only measures the file size and not the time or processor load penalties imposed by the use of compression, encoding, or Avro. The results generated in this test are still based on certain assumptions and your mileage may vary.

Here is the data if interested: http://people.apache.org/~dmeil/HBase_HFile_Size_2014_04.csv

Comments:

RPC compression was added in 0.96, does that significantly change this?

Posted by Brian on July 28, 2014 at 01:52 PM GMT #

thanks. very well done.

Posted by Hassan Ergene on November 20, 2014 at 05:18 AM GMT #

Thanks for a good article.May I know the command you were using, to get the consolidated footprint of HFile

Posted by Naga on October 12, 2015 at 09:03 PM GMT #

Hi Can anyone suggest if I have some very big string and its according to my usecase its my requirment to put that in rowkey how can i reduse the size of row key. I have options like I can use Md5 for this and at the time of retrival I can do the same but in this case I will lost actual text. second is I can go with some Encoding any inputs?

Posted by Manjeet on July 21, 2016 at 11:49 AM GMT #

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation