Apache Phoenix

Thursday February 28, 2019

NoSQL Day 2019

On May 21st in Washington, DC, there will be a one-day community event for Apache Accumulo, HBase, and Phoenix called NoSQL Day. We hope that these three Apache communities can come together to share stories from the field and learn from one another. This event is being offered by the DataWorks Summit organization, prior to their DataWorks Summit event May 20th through 23rd.

At this time, we are looking for speakers, attendees, and sponsors for the event. For speakers, we hope to see a wide breadth of subjects and focus, anything from performance, scaling, real-life applications, dev-ops, or best-practices. All speakers are welcome! Abstracts can be submitted here.

For attendees, we want to get the best and brightest from each of the respective communities because the organizers believe we have much to learn from from each other. We’ve tried to keep costs down to make this approachable for all.

Finally, sponsors are the major enabler to provide events like these at low-costs to attendees. If you are interested in a corporate sponsorship, please feel free to contact Josh Elser for more information.

For general questions, please feel free to mail Josh Elser or the Phoenix user mailing list.

Saturday July 14, 2018

Apache Phoenix releases next major version 5.0.0

The Apache Phoenix team is pleased to announce release of it's next major version 5.0.0 compatible with HBase 2.0+. Apache Phoenix enables SQL-based OLTP and operational analytics for Apache Hadoop using Apache HBase as its backing store and providing integration with other projects in the Apache ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. 

The 5.0.0 release has feature parity with recently released 4.14.0. Highlights of the release include: 

Source and binary downloads are available here.

Tuesday June 12, 2018

Announcing Phoenix 4.14 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.14.0 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Apache Hadoop using Apache HBase as its backing store and providing integration with other projects in the Apache ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. This 4.x release is compatible with HBase 0.98, 1.1, 1.2, 1.3, 1.4, and CDH 5.11.2, 5.12.2, 5.13.2, and 5.14.2

Highlights of the release include:

Source and binary downloads are available here.

Monday January 22, 2018

Announcing CDH compatible Phoenix 4.13.2 released

The Apache Phoenix team is pleased to announce the immediate availability of a CDH compatible 4.13.2 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Apache Hadoop using Apache HBase as its backing store and providing integration with other projects in the Apache ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible with HBase 0.98, 1.1, 1.2 and 1.3.

Highlights of the release include:

  • Compatibility with CDH 5.11.2 release
  • New parcels directory can be used directly as parcel repository from Cloudera Manager
  • Numerous bug fixes above and beyond the 4.13.0 release

Source and binary downloads are available here.

Sunday November 12, 2017

Announcing Phoenix 4.13 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.13.0 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Apache Hadoop using Apache HBase as its backing store and providing integration with other projects in the Apache ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible with HBase 0.98 and 1.3.

Highlights of the release include:

Source and binary downloads are available here.

Wednesday October 11, 2017

Announcing Phoenix 4.12 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.12.0 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Apache Hadoop using Apache HBase as its backing store and providing integration with other projects in the Apache ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible with HBase 0.98/1.1/1.2/1.3.

Highlights of the release include:

Source and binary downloads are available here.

Friday July 07, 2017

Announcing Phoenix 4.11 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.11.0 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Apache Hadoop using Apache HBase as its backing store and providing integration with other projects in the Apache ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible with HBase 0.98/1.1/1.2/1.3.

Highlights of the release include:

Source and binary downloads are available here.

Thursday March 23, 2017

Announcing Phoenix 4.10 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.10.0 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Hadoop using Apache HBase as its backing store and providing integration with other projects in the ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible with HBase 0.98/1.1/1.2.

Highlights of the release include:

Source and binary downloads are available here.

Column Mapping and Immutable Data Encoding

With Phoenix 4.10, we are rolling out a new feature that introduces a layer of column mapping between Phoenix column names and HBase column qualifiers. We have also added the capability of packing all column values for a column family into a single HBase cell. These improvements have helped improve performance across the board for the majority of use cases. In this blog, I will be going providing a bit more detailed info on these performance improvements.

Column Mapping

The motivation behind column mapping came from PHOENIX-1598. The key idea being that we should be using number based hbase column qualifiers for non-pk Phoenix columns instead of directly using column names. This helps Phoenix replace the need of having to do a binary search when looking for a cell in the sorted list of cells returned by HBase. This helps improve performance of certain queries (like ORDER BY or GROUP BY on non-pk axis) as the number of non-pk columns go up.

The indirection also enables us to write fast DDL operations like column rename (PHOENIX-2341) and metadata level column drops (PHOENIX-3680). Further, because these number based qualifiers are generally smaller (1 to 4 bytes) than column names, the disk size of tables is smaller which improves performance across the board.

To compare performance and disk space usage, we loaded 600 million rows of TPC-H data for LINEITEM table (downloaded from here) on to our test cluster using 1-byte qualifiers. HDFS disk size with column mapping was 40% smaller (100 GB) than with non-column mapped tables (160GB). As a consequence, the queries in the TPC-H benchmark against LINEITEM table (obtained from here) were also found to be 30-40% faster.

Column mapping also enables us to write custom projection and comparison filters that improve query performance as the number of columns being projected or filtered on go up (PHOENIX-3667). We did a test run where we compared query performance against non-column mapped and column mapped tables as the number of columns go up. As the graph below shows, as the number of columns projected increased, the performance gain by using the new filter also went up. 

Using column mapping is generally recommended unless you expect number of columns in your table and views on it to exceed 2147483647 (which is a lot!). Keep in mind though that for mutable tables this limit applies across all column families. For immutable tables, when using SINGLE_CELL_ARRAY_WITH_OFFSETS encoding scheme, this limit applies to per column family. In general, we expect that using a 2-byte column mapping scheme, which gives you 65535 columns, is good enough. One can override these defaults by using various table properties and configs. For more details on how to use column mapping and immutable data encoding, go here.

Immutable Data Encoding

The immutable storage scheme called SINGLE_CELL_ARRAY_WITH_OFFSETS packs columns belonging to a column family in a single cell. This drastically reduces the size of immutable data resulting in impressive size reduction and faster performance across the board.

To compare performance of queries for immutable encoded and non-encoded tables, we created a table with 25 VARCHAR non-pk columns, with each column name being 10 characters long having 15 character wide values. The table was dense i.e. more than 50% of the columns had values. HBase FAST_DIFF encoding was enabled which is the default with phoenix tables. All the queries were run with NO_CACHE hint to negate the effect of query performance because of block cache. We also made sure to take into consideration the effect of data being present in the OS page cache by ignoring query results for the first few runs.

As the graphs below show, using SINGLE_CELL_ARRAY_WITH_OFFSETS encoding drastically improves the performance for most kinds of queries. Data load time for 1M records using UPSERT with a batch size of 1000 was 3x faster. So were aggregate queries and queries that filtered on key value column. There was no significant impact on point queries though, which is expected.

It is important to note that this encoding could only be used when one of the numbered column mapping schemes is used. This is because internally the encoding relies on these number based column qualifiers to look up values of columns.

Future work/Limitations

Using the SINGLE_CELL_ARRAY_WITH_OFFSETS encoding scheme is recommended when the data is not sparse. Our general recommendation is to use this encoding when data is sufficiently dense (around 50% of columns have values). With growing sparseness the overhead of encoding starts negatively affecting performance (PHOENIX-3559). Also, we have seen that with the default HBase block size of 64K, performance starts to degrade once the size of the packed cell starts exceeding 50 KB. By default, for immutable multi-tenant tables, we use the ONE_CELL_PER_COLUMN encoding. Because of the way we assign column qualifiers for columns in views, it tends to make the data sparse especially when columns are added to views (PHOENIX-3575). There is also work that needs to be done for cleaning up data when a column is dropped from an immutable table with SINGLE_CELL_ARRAY_WITH_OFFSETS encoding (PHOENIX-3605).

Thursday December 01, 2016

Announcing Phoenix 4.9 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.9.0 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Hadoop using Apache HBase as its backing store and providing integration with other projects in the ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible with HBase 0.98/1.1/1.2.

Here are some of the highlights of the releases:

Source and binary downloads are available here.

Thursday August 18, 2016

Announcing Phoenix 4.8 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.8.0 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Hadoop using Apache HBase as its backing store and providing integration with other projects in the ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible with HBase 0.98/1.0/1.1/1.2.

Here are some of the highlights of the releases:

Source and binary downloads are available here.

Friday March 11, 2016

Announcing Phoenix 4.7 release with ACID transaction support

The Apache Phoenix team is pleased to announce the immediate availability of the 4.7.0 release. Apache Phoenix enables OLTP and operational analytics for Hadoop through SQL support and integration with other projects in the ecosystem such as Spark, HBase, Pig, Flume, and MapReduce.

Highlights of the release include:

Source and binary downloads are available here.

Tuesday November 03, 2015

New optimization for time series data in Apache Phoenix 4.6

Today's blog is brought to you by Samarth Jain, PMC member of Apache Phoenix, and Lead Member of the Technical Staff at Salesforce.com.

Apache Phoenix 4.6 now provides the capability of mapping a Phoenix primary key column to the native row timestamp of Apache HBase. The mapping is denoted by the keyword ROW_TIMESTAMP in the create table statement. Such a mapping provides the following two advantages: 

  • Allows Phoenix to set the min time range on scans since this column directly maps to the HBase cell timestamp. Presence of these time ranges lets HBase figure out which store files it should be scanning and which ones to skip. This comes in handy especially for temporal data when the queries are focused towards the tail end of the data.
  • Enables Phoenix to leverage the existing optimizations in place when querying against primary key columns.

Lets look at an example with some performance numbers to understand when a ROW_TIMESTAMP column could help.

Sample schema:

For performance analysis, we created two identical tables, one with the new ROW_TIMESTAMP qualifier and one without. 

CREATE TABLE EVENTS_RTS (
    EVENT_ID CHAR(15) NOT NULL,
    EVENT_TYPE CHAR(3) NOT NULL,
    EVENT_DATE DATE NOT NULL,
    APPLICATION_TYPE VARCHAR,
    SOURCE_IP VARCHAR
    CONSTRAINT PK PRIMARY KEY (
        EVENT_ID,
        EVENT_TYPE,
        EVENT_DATE ROW_TIMESTAMP))

The initial data load of 500 million records created data with the event_date set to dates over the last seven days. During the load, tables went through region splits and major compactions. After the initial load, we ran a mixed read/write workload with writes (new records) happening @500K records per hour. Each new row was created with EVENT_DATE as the current date/time.

Three sets of queries were executed that filtered on the EVENT_DATE column:

  • Newer than last hour's event data
  • Newer than last two day's event data
  • Outside of the time range of event data

For example, the following query would return the number of rows for the last hours worth of data:

SELECT COUNT(*) FROM EVENTS_RTS
WHERE EVENT_DATE > CURRENT_DATE() - 1/24

Below is the graph that shows variation of query times over the tail end of data (not major compacted) for the two tables

Below is a tabular summary of the various time ranges that were tested over the non-major compacted event data

Time # Duration(ms)
Range Rows Returned With Optimization Without Optimization
CREATED IN LAST 1 MINUTE 16K 200 4000
CREATED IN LAST 15 MINUTES 125K 700 130000
CREATED IN LAST 1 HOUR 500K 2100 500000
CREATED BEFORE LAST 8 DAYS 0 100 340000

As you can see from the results, using a ROW_TIMESTAMP gives a huge perf boost when querying over data that hasn’t been major compacted. For already major compacted data, the two tables show the same performance (i.e. there is no degradation). The query returning 0 records is a special case in which the date range falls out of the data that was loaded to the tables. Such a query returns almost instantaneously for EVENTS_RTS (0.1 seconds). The same query on EVENTS_WITHOUT_RTS takes more than 300 seconds. This is because with the time range information available on scans, HBase was quickly able to figure out that no store files have data within the range yielding a near instant response.

Effect of HBase major compaction

The HBase store file (HFile) stores time range (min and max row timestamps) in its metadata. When a scan comes in, HBase is able to look at this metadata and figure out whether it should be scanning the store file for returning the records the query has requested. When writes are happening to an HBase table, after crossing a threshold size, contents of the memstore are flushed to an HFile. Now if the queries are against the newly created (tail-end of data) HFiles, one would see a huge perf boost when using the ROW_TIMESTAMP column. This is because, the scans issued by Phoenix would need to read only these newly created store files. On the other hand, queries not utilizing the row_timestamp column will have to potentially scan the entire table.

The perf benefits are negated however, when HBase runs a major compaction on the table. In the default compaction policy, when number of HFiles exceeds a certain threshold or when a pre-determined time period crosses, HBase performs a major compaction to consolidate the number of store files in a region to one. This effectively ends up setting the time range of the lone store file to all the data contained within that region. As a result, scans are no longer able to filter out what store files to skip since the lone store file happens to contain all the data. Do note that in such a condition, the performance of the query with the row_timestamp column is the same as the one without.

In conclusion, if your table has a date based primary key and your queries are geared towards the tail-end of the data, you should think about using a row_timestamp column as it could yield huge performance gains.

Potential Future Work

One question you may be asking yourself is Why does performance drop after a major compaction occurs? I thought performance was supposed to improve after compaction. Time series data is different than other data in that it's typically write-once, append only. There are ways that this property of the data can be exploited such that better performance is maintained. For some excellent ideas along these lines, see Vladimir Rodionov's presentation from a previous HBase Meetup here.

Thursday August 13, 2015

Spatial data queries in Phoenix

Take a look at this excellent write-up by Dan Meany on implementing spatial data queries on Apache Phoenix using UDFs and secondary indexinghttps://github.com/threedliteguy/General/wiki/Adding-spatial-data-queries-to-Phoenix-on-HBase

Thursday July 30, 2015

Announcing Phoenix 4.5 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.5.0 release. Phoenix is a relational database layer on top of Apache HBase accessed as a JDBC driver for querying, updating, and managing HBase tables using SQL. The 4.x releases are compatible with HBase 0.98/1.0/1.1.

Here are some of the highlights of the 4.4 and 4.5 releases:

Source and binary downloads are available here.

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation