Apache Phoenix

Thursday December 01, 2016

Announcing Phoenix 4.9 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.9.0 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Hadoop using Apache HBase as its backing store and providing integration with other projects in the ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible with HBase 0.98/1.1/1.2.

Here are some of the highlights of the releases:

Source and binary downloads are available here.

Thursday August 18, 2016

Announcing Phoenix 4.8 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.8.0 release. Apache Phoenix enables SQL-based OLTP and operational analytics for Hadoop using Apache HBase as its backing store and providing integration with other projects in the ecosystem such as Spark, Hive, Pig, Flume, and MapReduce. The 4.x releases are compatible with HBase 0.98/1.0/1.1/1.2.

Here are some of the highlights of the releases:

Source and binary downloads are available here.

Friday March 11, 2016

Announcing Phoenix 4.7 release with ACID transaction support

The Apache Phoenix team is pleased to announce the immediate availability of the 4.7.0 release. Apache Phoenix enables OLTP and operational analytics for Hadoop through SQL support and integration with other projects in the ecosystem such as Spark, HBase, Pig, Flume, and MapReduce.

Highlights of the release include:

Source and binary downloads are available here.

Tuesday November 03, 2015

New optimization for time series data in Apache Phoenix 4.6

Today's blog is brought to you by Samarth Jain, PMC member of Apache Phoenix, and Lead Member of the Technical Staff at Salesforce.com.

Apache Phoenix 4.6 now provides the capability of mapping a Phoenix primary key column to the native row timestamp of Apache HBase. The mapping is denoted by the keyword ROW_TIMESTAMP in the create table statement. Such a mapping provides the following two advantages: 

  • Allows Phoenix to set the min time range on scans since this column directly maps to the HBase cell timestamp. Presence of these time ranges lets HBase figure out which store files it should be scanning and which ones to skip. This comes in handy especially for temporal data when the queries are focused towards the tail end of the data.
  • Enables Phoenix to leverage the existing optimizations in place when querying against primary key columns.

Lets look at an example with some performance numbers to understand when a ROW_TIMESTAMP column could help.

Sample schema:

For performance analysis, we created two identical tables, one with the new ROW_TIMESTAMP qualifier and one without. 

CREATE TABLE EVENTS_RTS (
    EVENT_ID CHAR(15) NOT NULL,
    EVENT_TYPE CHAR(3) NOT NULL,
    EVENT_DATE DATE NOT NULL,
    APPLICATION_TYPE VARCHAR,
    SOURCE_IP VARCHAR
    CONSTRAINT PK PRIMARY KEY (
        EVENT_ID,
        EVENT_TYPE,
        EVENT_DATE ROW_TIMESTAMP))

The initial data load of 500 million records created data with the event_date set to dates over the last seven days. During the load, tables went through region splits and major compactions. After the initial load, we ran a mixed read/write workload with writes (new records) happening @500K records per hour. Each new row was created with EVENT_DATE as the current date/time.

Three sets of queries were executed that filtered on the EVENT_DATE column:

  • Newer than last hour's event data
  • Newer than last two day's event data
  • Outside of the time range of event data

For example, the following query would return the number of rows for the last hours worth of data:

SELECT COUNT(*) FROM EVENTS_RTS
WHERE EVENT_DATE > CURRENT_DATE() - 1/24

Below is the graph that shows variation of query times over the tail end of data (not major compacted) for the two tables

Below is a tabular summary of the various time ranges that were tested over the non-major compacted event data

Time # Duration(ms)
Range Rows Returned With Optimization Without Optimization
CREATED IN LAST 1 MINUTE 16K 200 4000
CREATED IN LAST 15 MINUTES 125K 700 130000
CREATED IN LAST 1 HOUR 500K 2100 500000
CREATED BEFORE LAST 8 DAYS 0 100 340000

As you can see from the results, using a ROW_TIMESTAMP gives a huge perf boost when querying over data that hasn’t been major compacted. For already major compacted data, the two tables show the same performance (i.e. there is no degradation). The query returning 0 records is a special case in which the date range falls out of the data that was loaded to the tables. Such a query returns almost instantaneously for EVENTS_RTS (0.1 seconds). The same query on EVENTS_WITHOUT_RTS takes more than 300 seconds. This is because with the time range information available on scans, HBase was quickly able to figure out that no store files have data within the range yielding a near instant response.

Effect of HBase major compaction

The HBase store file (HFile) stores time range (min and max row timestamps) in its metadata. When a scan comes in, HBase is able to look at this metadata and figure out whether it should be scanning the store file for returning the records the query has requested. When writes are happening to an HBase table, after crossing a threshold size, contents of the memstore are flushed to an HFile. Now if the queries are against the newly created (tail-end of data) HFiles, one would see a huge perf boost when using the ROW_TIMESTAMP column. This is because, the scans issued by Phoenix would need to read only these newly created store files. On the other hand, queries not utilizing the row_timestamp column will have to potentially scan the entire table.

The perf benefits are negated however, when HBase runs a major compaction on the table. In the default compaction policy, when number of HFiles exceeds a certain threshold or when a pre-determined time period crosses, HBase performs a major compaction to consolidate the number of store files in a region to one. This effectively ends up setting the time range of the lone store file to all the data contained within that region. As a result, scans are no longer able to filter out what store files to skip since the lone store file happens to contain all the data. Do note that in such a condition, the performance of the query with the row_timestamp column is the same as the one without.

In conclusion, if your table has a date based primary key and your queries are geared towards the tail-end of the data, you should think about using a row_timestamp column as it could yield huge performance gains.

Potential Future Work

One question you may be asking yourself is Why does performance drop after a major compaction occurs? I thought performance was supposed to improve after compaction. Time series data is different than other data in that it's typically write-once, append only. There are ways that this property of the data can be exploited such that better performance is maintained. For some excellent ideas along these lines, see Vladimir Rodionov's presentation from a previous HBase Meetup here.

Thursday August 13, 2015

Spatial data queries in Phoenix

Take a look at this excellent write-up by Dan Meany on implementing spatial data queries on Apache Phoenix using UDFs and secondary indexinghttps://github.com/threedliteguy/General/wiki/Adding-spatial-data-queries-to-Phoenix-on-HBase

Thursday July 30, 2015

Announcing Phoenix 4.5 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.5.0 release. Phoenix is a relational database layer on top of Apache HBase accessed as a JDBC driver for querying, updating, and managing HBase tables using SQL. The 4.x releases are compatible with HBase 0.98/1.0/1.1.

Here are some of the highlights of the 4.4 and 4.5 releases:

Source and binary downloads are available here.

Monday June 29, 2015

Spark Integration in Apache Phoenix

Today's blog is brought to you by our latest committer and the developer behind the Spark integration in Apache Phoenix, Josh Mahonin, a Software Architect at Interset.

PageRank with Phoenix and Spark

The Phoenix SQL interface provides a lot of great analytics capabilities on top of structured HBase data. Some tasks however, such as machine learning or graph analysis, are more efficiently done using other tools like Apache Spark.

Since the Phoenix 4.4.0 release, the phoenix-spark module allows us to expose Phoenix tables as RDDs or DataFrames within Spark. From there, that same data can be used with other tools within Spark, such as the machine learning library MLlib, the graph engine GraphX, or Spark Streaming.

This example makes use of the Enron email test set from Stanford Network Analysis Project, and executes the GraphX implementation of PageRank on it to find interesting entities. It then saves the results back to Phoenix.

Note that runnable source code is also available on Github

Prerequisites

  • Phoenix 4.4.0+
  • Spark 1.3.0+ (ensure phoenix-client JAR is in the Spark driver classpath, see setup guide )

Load sample data

Login to a node with the Apache Phoenix binaries available. I will use localhost to refer to the Phoenix URL, but you may need to adjust to your local environment

cd /path/to/phoenix/bin
./sqlline.py localhost

Once in the SQLLine console, we'll create the tables to hold the input data, and the destination table for the pagerank results

CREATE TABLE EMAIL_ENRON(
    MAIL_FROM BIGINT NOT NULL, 
    MAIL_TO BIGINT NOT NULL 
    CONSTRAINT pk PRIMARY KEY(MAIL_FROM, MAIL_TO));
CREATE TABLE EMAIL_ENRON_PAGERANK(
    ID BIGINT NOT NULL, 
    RANK DOUBLE 
    CONSTRAINT pk PRIMARY KEY(ID));

Use 'ctrl+d' to exit SQLline

Download and extract the file enron.csv.gz to a local directory, such as /tmp. We'll use 'psql.py' to load the CSV data

gunzip /tmp/enron.csv.gz
./psql.py -t EMAIL_ENRON localhost /tmp/enron.csv

When finished, you should see the output:
CSV Upsert complete. 367662 rows upserted

Interactive analysis with spark-shell

Login to a node with Spark installed. Note that the phoenix-client JAR must be available in the Spark driver classpath

cd /path/to/spark/bin
./spark-shell

Once you're in the spark shell, you can type, or copy the code below into the interactive shell

import org.apache.spark.graphx._
import org.apache.phoenix.spark._

// Load the phoenix table
val rdd = sc.phoenixTableAsRDD("EMAIL_ENRON", Seq("MAIL_FROM", "MAIL_TO"), zkUrl=Some("localhost"))

// Convert to an RDD of VertexId tuples
val rawEdges = rdd.map{ e => (e("MAIL_FROM").asInstanceOf[VertexId], e("MAIL_TO").asInstanceOf[VertexId]) }

// Create a graph with default edge weights
val graph = Graph.fromEdgeTuples(rawEdges, 1.0)

// Run page rank
val pr = graph.pageRank(0.001)

// Save to Phoenix
pr.vertices.saveToPhoenix("EMAIL_ENRON_PAGERANK", Seq("ID", "RANK"), zkUrl = Some("localhost"))

Once finished, you can exit spark-shell with 'ctrl+d'

Results

On your Phoenix node, open sqlline again

cd /path/to/phoenix/bin ./sqlline.py localhost

Let's run a query that will give us the top-ranked entities from the PageRank results

SELECT * FROM EMAIL_ENRON_PAGERANK ORDER BY RANK DESC LIMIT 5;

+------------------------------------------+------------------------------------------+
|                    ID                    |                   RANK                   |
+------------------------------------------+------------------------------------------+
| 5038                                     | 497.2989872977676                        |
| 273                                      | 117.18141799210386                       |
| 140                                      | 108.63091596789913                       |
| 458                                      | 107.2728800448782                        |
| 588                                      | 106.11840798585399                       |
+------------------------------------------+------------------------------------------+

Although this data-set has the email addresses removed, if you're curious, you can find results of a similar analysis here. If you're familiar with the Enron case, some of those names will ring a bell.

Conclusion

Although this example is fairly trivial, it shows the capabilities, as well as succinctness, of using Phoenix and Spark together to run complex algorithms across arbitrarily large datasets. In my experience, the methods shown here extend quite well to other "big data" problems such as community detection and clustering, as well as anomaly detection. There are likely many other problem domains which are applicable as well

Thanks for reading!

Friday April 24, 2015

TPC in Apache Phoenix

I often get asked "Does Apache Phoenix support joins?" That's when I know that marketing is hard as Phoenix has supported joins for over a year! In fact, thanks to the excellent work of Maryann Xue, one of our project management committee (PMC) members, Phoenix can run many of the complex TPC queries. TPC benchmarks are used in evaluating the performance of computer systems. From wikipedia:

Transaction Processing Performance Council (TPC) is a non-profit organization founded in 1988 to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry.  

Here are some of the query constructs that Phoenix support that you may not know about:

Derived Tables to create on-the-fly views:

  SELECT LastName, FirstName
    FROM
      ( SELECT BusinessEntityID, LastName, FirstName FROM Employee
        WHERE State = 'NY' ) AS EmployeeDerivedTable
    WHERE LastName = "Smith"
    ORDER BY FirstName;

Correlated Sub-queries to correlate column values in inner and outer queries:

  SELECT DISTINCT e.LastName, e.FirstName, e.BusinessEntityID, sp.Bonus
  FROM Employee AS e
  JOIN SalesPerson AS sp ON c.BusinessEntityID = sp.BusinessEntityID
  WHERE c.Bonus >=
  ( SELECT average(sp2.Bonus)
    FROM SalesPerson sp2
    JOIN Employee AS e2 ON e2.BusinessEntityID = sp2.BusinessEntityID
    WHERE e.DepartmentID = e2.DepartmentID );

Semi/Anti Joins to test for the existence or nonexistence of a row:

  SELECT p.Name, p.Job, p.EmployerID
  FROM People AS p
  WHERE EXISTS ( SELECT 1 FROM Employers AS e WHERE e.ID = p.EmployerID );

Union All to concatenate rows from multiple tables (available in 4.4.0 release):

  SELECT * FROM Employee e
  WHERE e.BusinessEntityID IN
  ( SELECT BusinessEntityID FROM SalesPerson WHERE Ranking >= 5.0
    UNION ALL
    SELECT BusinessEntityID FROM CustomerReview WHERE Score >= 8.0 ) 

Phoenix has support for both a broadcast hash join strategy, when one side of the join is small enough to fit into memory, as well as a sort merge join strategy for cases when it won't. Put all this functionality together, and Phoenix is capable of running many TPC queries.

Here's an example query from one of the TPC queries:

  SELECT supp_nation, cust_nation, l_year, sum(volume) as revenue
   FROM ( SELECT
    n1.n_name as supp_nation, n2.n_name as cust_nation,
    year(l_shipdate) as l_year, 
    l_extendedprice * (1 - l_discount) as volume
    FROM Supplier
    JOIN LineItem ON s_suppkey = l_suppkey
    JOIN Orders ON o_orderkey = l_orderkey
    JOIN Customer ON c_custkey = o_custkey
    JOIN Nation n1 ON s_nationkey = n1.n_nationkey
    JOIN Nation n2 ON c_nationkey = n2.n_nationkey
    WHERE ( n1.n_name = '[NATION1]' and n2.n_name = '[NATION2]' )
       OR (n1.n_name = '[NATION2]' and n2.n_name = '[NATION1]') )
       AND l_shipdate between date '1995-01-01' and date '1996-12-31'
  ) AS Shipping
  GROUP BY supp_nation, cust_nation, l_year
  ORDER BY supp_nation, cust_nation, l_year;

So, please repeat after me: "YES, Phoenix supports joins" :-)

Interested in learning more? Come hear Maryann Xue talk at the upcoming Apache Phoenix meetup on May 5th at the Hortonworks HQ in Santa Clara, or hear our joint talk at HBaseCon 2015 on May 7th where we'll discuss our plan for getting the rest of the way to complete ANSI SQL 92 support.

Tuesday February 24, 2015

Announcing Phoenix 4.3 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.3 release. Phoenix is a relational database layer on top of Apache HBase accessed as a JDBC driver for querying, updating, and managing HBase tables using SQL. The 4.x releases are compatible with HBase 0.98.1 and above.

The release includes:

  • Functional indexes to allow arbitrary expressions to be indexed
  • Map reduce over Phoenix tables
  • Cross join support
  • Query hint to force index usage
  • Set HBase properties through ALTER TABLE statement
  • ISO-8601 date format support on input
  • New RAND built-in function to generate random numbers
  • ANSI SQL date/time literal support
  • Query timeout support in JDBC Statement
  • ~90 bug fixes

Source and binary downloads are available here.

Thursday December 11, 2014

Announcing Phoenix 4.2.2 and 3.2.2 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.2.2/3.2.2 release. Phoenix is a relational database layer on top of Apache HBase accessed as a JDBC driver for querying, updating, and managing HBase tables using SQL. The 4.x releases are compatible with HBase 0.98 while the 3.x releases are compatible with HBase 0.94.

The release includes:

Source and binary downloads are available here.

Friday September 19, 2014

Getting started with Phoenix just got easier

Just wanted to pass along a very nice blog from SequenceIQ on a pre-cooked Docker that they setup for Phoenix+HBase: http://blog.sequenceiq.com/blog/2014/09/04/sql-on-hbase-with-apache-phoenix/

Saturday September 06, 2014

Connecting HBase to Elasticsearch through Phoenix in 10 min or less

It's always exciting when folks reach out to me to demonstrate new ways of leveraging Apache Phoenix in this fantastic open source ecosystem. In this case, Alex Kamil gives step-by-step instructions on usnig Phoenix with Elasticsearch to enable full-text search on HBase tables: http://lessc0de.github.io/connecting_hbase_to_elasticsearch.html

Enjoy!

Saturday August 30, 2014

Announcing Phoenix 3.1 and 4.1 Released

The Apache Phoenix team is pleased to announce its first releases as a top level project. Phoenix is the best performing open source SQL database for Apache HBase, a NoSQL data store. Phoenix is accessed as a JDBC driver and enables querying, updating, and managing HBase tables using SQL.

Our 3.1 release (for HBase 0.94.4+) includes:

Our 4.1 release (for HBase 0.98.1+) includes all of the above plus:

  • Tracing leveraging Cloudera’s HTrace library to show per-request performance metrics all they way from the client through into the HBase server, and back again
  • Local Indexes targeting write-heavy, space-constrained use cases by co-locating index and table data on the same region server

Source and binary downloads are available here.

Wednesday May 28, 2014

OLAP with Apache Phoenix and HBase

Today's blog is brought to you by Juan Rodríguez Hortalá of LAMBDOOP

One of the benefits of having a SQL query interface for a database is that SQL has become a lingua franca that is used as the basis for the interoperability of many systems. One example of that are visualization tools, that usually accept SQL connections as data sources. In this post we'll see how to use Apache Phoenix as a bridge between HBase and Saiku Analytics, a visualization tool for Business Intelligence.


Saiku is an open source visualization tool for Online Analytical Processing (OLAP) that consists in modeling a given business process as a set of facts corresponding to business transactions, that contain measures of quantities relevant for the business, and which are categorized by several dimensions that describe the context of the transaction.


For example if we want to analyze the traffic of a website then we might define a measure ACTIVE_VISITOR for the number of visitors and put it into context by considering the date of the visit (DATE dimension), or the part of the site which was visited (FEATURE dimension).


Facts and dimensions are collected into a logical (and sometimes also physical) structure called OLAP cube, which can be thought as a multidimensional array indexed by the dimensions and with the measures as values. OLAP cubes are queried using MultiDimensional eXpressions (MDX), a query language for specifying selections and aggregations in a natural way over the multidimensional structure of a cube. And here comes Saiku! Saiku is an OLAP front-end for visualizing the results of MDX queries, both in tabular format or as different types of charts, and that also offers a drag and drop interface for generating MDX queries.


Saiku accepts several data sources, and in particular it is able to use the Mondrian server to implement ROLAP, which is a popular technique for OLAP that uses a relational database to store and compute the cubes. Did I say relational? Well, in fact all that Mondrian needs is a JDBC driver ... And this leads us to this post, where I'll share my experiences combining Phoenix with Mondrian to use HBase as the analytical database in the backend of an OLAP system with Saiku as the front-end. As there are a few layers in this setting, let's describe the flow of a query in this system before entering into more details:

  1. Saiku uses an schema file provided during configuration that contains the declarations of some OLAP cubes, and displays the dimensions and measures of the cubes to the user.

  2. The user selects a cube and builds a query using Saiku’s drag and drop interface over the cube.

  3. Saiku generates an MDX expression and passes it to Mondrian for execution.

  4. Mondrian compiles the MDX expression into an SQL query, and then uses Phoenix JDBC to execute it.

  5. HBase executes the query and then gives the results back to Phoenix.

  6. Phoenix passes the results to Mondrian, which passes them to Saiku, which then renders the results as a table in the user interface.  


Setup and example

We’ll see how to setup Saiku to work with Phoenix with a simple example. This procedure has been tested in a Cloudera Quickstart VM for CDH 4.4.0, which is shipped with HBase 0.94.6, therefore we’ll be using Phoenix 3. First of all we have to download phoenix 3 and install it by copying phoenix-core-*.jar in the classpath of each region server (for example at /usr/lib/hbase/lib/) and then restarting them. Regarding Saiku, it can be installed as a Pentaho BA Server plugin, or as a stand alone application as we'll do in this tutorial. To do that just download Saiku Server 2.5 (including Foodmart DB) and extract the compressed file, which contains a Tomcat distribution with Saiku installed as a webapp. To start and stop Saiku use the scripts start-saiku.sh and stop-saiku.sh. By default Saiku runs at port 8080 and uses ‘admin’ as both user and password for the login page. After login you should see be able to select one of the cubes of the Foodmart DB at the combo at the upper left corner of the UI.


Now it’s time to connect Saiku to Phoenix. The first step is defining the structure of the OLAP cube we are going to analyze. In the example above we considered analyzing the traffic of a website by measuring quantities like the number of visitors along dimensions like the date of the visit or the part of the site that was visited. This happens to correspond to the structure of the PERFORMANCE tables created by the script bin/performance.py shipped with Phoenix. So let’s run the command bin/performance.py localhost 100000 so a table PERFORMANCE_100000 with 100.000 rows with the following schema is created:


CREATE TABLE IF NOT EXISTS PERFORMANCE_100000 (

   HOST CHAR(2) NOT NULL, DOMAIN VARCHAR NOT NULL,

   FEATURE VARCHAR NOT NULL, DATE DATE NOT NULL,

   USAGE.CORE BIGINT, USAGE.DB BIGINT, STATS.ACTIVE_VISITOR INTEGER

 CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN, FEATURE, DATE)

) SPLIT ON ('CSGoogle','CSSalesforce','EUApple','EUGoogle',

    'EUSalesforce', 'NAApple','NAGoogle','NASalesforce');


In this table the 4 columns that compose the primary key correspond to the dimensions of the cube, while the other 3 columns are the quantities to be measured. That is a fact table with with 4 degenerate dimensions (see e.g. Star Schema The Complete Reference for more details about dimensional modeling), that represents and OLAP cube. Nevertheless there is a problem with this cube: the granularity for the time dimension it’s too fine, as we can see in the following query.


0: jdbc:phoenix:localhost> SELECT TO_CHAR(DATE) FROM PERFORMANCE_1000000 LIMIT 5;

+---------------+

| TO_CHAR(DATE) |

+---------------+

| 2014-05-15 09:47:42 |

| 2014-05-15 09:47:50 |

| 2014-05-15 09:49:09 |

| 2014-05-15 09:49:23 |

| 2014-05-15 09:49:44 |

+---------------+

5 rows selected (0.161 seconds)


We have a resolution of seconds in the DATE column, so if we grouped values by date we would end up with a group per record, hence this dimensions is useless. To fix this we just have to add a new column with a coarser granularity, for example of days. To do that we can use the following script, that should be invoked as bash proc_performance_table.sh <path to phoenix bin directory> PERFORMANCE_100000.


#!/bin/bash


if [ $# -ne 2 ]

then

echo "Usage: $0 <phoenix bin directory> <table name>"

exit 1

fi


PHOENIX_BIN_DIR=$1

IN_TABLE=$2


OUT_TABLE="${IN_TABLE}_PROC"

echo "Table $OUT_TABLE will be created"

pushd $PHOENIX_BIN_DIR


./sqlline.py localhost <<END

CREATE TABLE IF NOT EXISTS $OUT_TABLE (

HOST CHAR(2) NOT NULL,

DOMAIN VARCHAR NOT NULL,

FEATURE VARCHAR NOT NULL,

DATE DATE NOT NULL,

DATE_DAY DATE NOT NULL,

USAGE.CORE BIGINT,

USAGE.DB BIGINT,

STATS.ACTIVE_VISITOR INTEGER

CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN, FEATURE, DATE))

SPLIT ON

('CSGoogle','CSSalesforce','EUApple','EUGoogle','EUSalesforce', 'NAApple','NAGoogle','NASalesforce');


!autocommit on

UPSERT INTO $OUT_TABLE

SELECT HOST, DOMAIN, FEATURE, DATE,

TRUNC(DATE, 'DAY') as DATE_DAY,

USAGE.CORE, USAGE.DB, STATS.ACTIVE_VISITOR

FROM $IN_TABLE;

!quit

END


echo "Done"

popd


That creates a new table PERFORMANCE_100000_PROC with an extra column DATE_DAY for the date with a resolution of days. We’ll perform the analysis over this table. The next step is creating a Mondrian schema that maps an OLAP cube to this table, for Mondrian to know which column to use as a dimension, and how to define the measures. Create a file Phoenix.xml with the following contents:


<?xml version="1.0"?>

<Schema name="Phoenix">

<Cube name="PERFORMANCE_100000_PROC">

<Table name="PERFORMANCE_100000_PROC"/>

<Dimension name="HOST">

 <Hierarchy hasAll="true" allMemberName="All Types">

   <Level name="HOST" column="HOST" uniqueMembers="false"/>

 </Hierarchy>

</Dimension>

<Dimension name="DOMAIN">

 <Hierarchy hasAll="true" allMemberName="All Types">

   <Level name="DOMAIN" column="DOMAIN" uniqueMembers="false"/>

 </Hierarchy>

</Dimension>

<Dimension name="FEATURE">

 <Hierarchy hasAll="true" allMemberName="All Types">

   <Level name="FEATURE" column="FEATURE" uniqueMembers="false"/>

 </Hierarchy>

</Dimension>

<Dimension name="DATE">

 <Hierarchy hasAll="true" allMemberName="All Types">

   <Level name="DAY" column="DATE_DAY" uniqueMembers="false"/>

 </Hierarchy>

</Dimension>

<Measure name="CORE" column="CORE" aggregator="sum" formatString="Standard"/>

<Measure name="DB" column="DB" aggregator="sum" formatString="Standard"/>

<Measure name="ACTIVE_VISITOR" column="ACTIVE_VISITOR" aggregator="sum" formatString="Standard"/>

</Cube>

</Schema>


There we have declared an schema “Phoenix” containing a single cube “PERFORMANCE_100000_PROC” defined over the table “PERFORMANCE_100000_PROC”. This cube has 4 dimensions with a single level, that corresponds to the columns HOST, DOMAIN, FEATURE and DATE_DAY, and 3 metrics that are computed adding the values in the groups determined by the dimensions, over the columns  CORE, DB and ACTIVE_VISITOR. As we can see in tomcat/webapps/saiku/WEB-INF/lib/mondrian-3.5.7.jar, this version of  Saiku (2.5) uses Mondrian 3, so take a look at the Mondrian 3 documentation or to Mondrian in Action for more information about Mondrian schemas if you are interested.

Now we have to create a new Saiku data source that uses this schema, by creating a file <saiku install directory>/tomcat/webapps/saiku/WEB-INF/classes/saiku-datasources/phoenix with the following contents:


type=OLAP

name=phoenix

driver=mondrian.olap4j.MondrianOlap4jDriver

location=jdbc:mondrian:Jdbc=jdbc:phoenix:localhost;Catalog=res:phoenix/Phoenix.xml;JdbcDrivers=org.apache.phoenix.jdbc.PhoenixDriver


That declares an OLAP connection named phoenix that uses Mondrian as the driver, and Phoenix as JDBC. We just need to do a couple more things for this to work:


  • Add the jars for Phoenix, HBase and Hadoop to Saiku’s classpath, by copying phoenix-*client-without-hbase.jar, /usr/lib/hbase/hbase.jar, /usr/lib/hadoop/hadoop-common.jar and /usr/lib/hadoop/hadoop-auth.jar into  <saiku install directory>/tomcat/lib.

  • Create a directory <saiku install directory>/tomcat/webapps/saiku/WEB-INF/classes/phoenix and copy the mapping file Phoenix.xml into it.


Now restart Saiku and the new cube should appear. I suggest watching tomcat/logs/catalina.out and tomcat/logs/saiku.log at the Saiku install path for possible errors and other information.


If we take a look to saiku.log we can see the MDX query generated by Saiku, and the running time for the query in HBase. These execution times are for a Cloudera Quickstart VM running on VMware Player with 4448 MB RAM and 4 cores, in a laptop with an i7 processor and 8 GB RAM, and with default settings for HBase, Phoenix and Saiku.


2014-05-18 09:20:58,225 DEBUG [org.saiku.web.rest.resources.QueryResource] TRACK    /query/CC92DB19-D97E-E68D-B927-E6259DB6BF70/resultflattened    GET

2014-05-18 09:20:58,226 INFO  [org.saiku.service.olap.OlapQueryService] runId:2    Type:QM:

SELECT

NON EMPTY {[Measures].[CORE], [Measures].[DB], [Measures].[ACTIVE_VISITOR]} ON COLUMNS,

NON EMPTY CrossJoin([DATE].[DAY].Members, CrossJoin([DOMAIN].[DOMAIN].Members, CrossJoin([FEATURE].[FEATURE].Members, [HOST].[HOST].Members))) ON ROWS

FROM [PERFORMANCE_100000_PROC]

2014-05-18 09:20:59,220 INFO  [org.saiku.service.olap.OlapQueryService] runId:2    Size: 7/55    Execute:    970ms    Format:    24ms     Total: 994ms


The same query for a performance table of 1 million records is completed in an order of 8 to 18 seconds in the same setting.

That's all! The next natural step would be defining a richer star schema with independent dimension tables, and testing the limits of Phoenix's star-join optimization. I hope this post has encouraged you to try it yourself.

Friday May 23, 2014

Apache Phoenix graduates as top level project

I'm pleased to announce that Apache Phoenix, the SQL database engine over HBase, has graduated as a top level Apache project. Congratulations to the entire team!

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation