Apache Phoenix

Monday June 29, 2015

Spark Integration in Apache Phoenix

Today's blog is brought to you by our latest committer and the developer behind the Spark integration in Apache Phoenix, Josh Mahonin, a Software Architect at Interset.

PageRank with Phoenix and Spark

The Phoenix SQL interface provides a lot of great analytics capabilities on top of structured HBase data. Some tasks however, such as machine learning or graph analysis, are more efficiently done using other tools like Apache Spark.

Since the Phoenix 4.4.0 release, the phoenix-spark module allows us to expose Phoenix tables as RDDs or DataFrames within Spark. From there, that same data can be used with other tools within Spark, such as the machine learning library MLlib, the graph engine GraphX, or Spark Streaming.

This example makes use of the Enron email test set from Stanford Network Analysis Project, and executes the GraphX implementation of PageRank on it to find interesting entities. It then saves the results back to Phoenix.

Note that runnable source code is also available on Github


  • Phoenix 4.4.0+
  • Spark 1.3.0+ (ensure phoenix-client JAR is in the Spark driver classpath, see setup guide )

Load sample data

Login to a node with the Apache Phoenix binaries available. I will use localhost to refer to the Phoenix URL, but you may need to adjust to your local environment

cd /path/to/phoenix/bin
./sqlline.py localhost

Once in the SQLLine console, we'll create the tables to hold the input data, and the destination table for the pagerank results


Use 'ctrl+d' to exit SQLline

Download and extract the file enron.csv.gz to a local directory, such as /tmp. We'll use 'psql.py' to load the CSV data

gunzip /tmp/enron.csv.gz
./psql.py -t EMAIL_ENRON localhost /tmp/enron.csv

When finished, you should see the output:
CSV Upsert complete. 367662 rows upserted

Interactive analysis with spark-shell

Login to a node with Spark installed. Note that the phoenix-client JAR must be available in the Spark driver classpath

cd /path/to/spark/bin

Once you're in the spark shell, you can type, or copy the code below into the interactive shell

import org.apache.spark.graphx._
import org.apache.phoenix.spark._

// Load the phoenix table
val rdd = sc.phoenixTableAsRDD("EMAIL_ENRON", Seq("MAIL_FROM", "MAIL_TO"), zkUrl=Some("localhost"))

// Convert to an RDD of VertexId tuples
val rawEdges = rdd.map{ e => (e("MAIL_FROM").asInstanceOf[VertexId], e("MAIL_TO").asInstanceOf[VertexId]) }

// Create a graph with default edge weights
val graph = Graph.fromEdgeTuples(rawEdges, 1.0)

// Run page rank
val pr = graph.pageRank(0.001)

// Save to Phoenix
pr.vertices.saveToPhoenix("EMAIL_ENRON_PAGERANK", Seq("ID", "RANK"), zkUrl = Some("localhost"))

Once finished, you can exit spark-shell with 'ctrl+d'


On your Phoenix node, open sqlline again

cd /path/to/phoenix/bin ./sqlline.py localhost

Let's run a query that will give us the top-ranked entities from the PageRank results


|                    ID                    |                   RANK                   |
| 5038                                     | 497.2989872977676                        |
| 273                                      | 117.18141799210386                       |
| 140                                      | 108.63091596789913                       |
| 458                                      | 107.2728800448782                        |
| 588                                      | 106.11840798585399                       |

Although this data-set has the email addresses removed, if you're curious, you can find results of a similar analysis here. If you're familiar with the Enron case, some of those names will ring a bell.


Although this example is fairly trivial, it shows the capabilities, as well as succinctness, of using Phoenix and Spark together to run complex algorithms across arbitrarily large datasets. In my experience, the methods shown here extend quite well to other "big data" problems such as community detection and clustering, as well as anomaly detection. There are likely many other problem domains which are applicable as well

Thanks for reading!

Friday April 24, 2015

TPC in Apache Phoenix

I often get asked "Does Apache Phoenix support joins?" That's when I know that marketing is hard as Phoenix has supported joins for over a year! In fact, thanks to the excellent work of Maryann Xue, one of our project management committee (PMC) members, Phoenix can run many of the complex TPC queries. TPC benchmarks are used in evaluating the performance of computer systems. From wikipedia:

Transaction Processing Performance Council (TPC) is a non-profit organization founded in 1988 to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry.  

Here are some of the query constructs that Phoenix support that you may not know about:

Derived Tables to create on-the-fly views:

  SELECT LastName, FirstName
      ( SELECT BusinessEntityID, LastName, FirstName FROM Employee
        WHERE State = 'NY' ) AS EmployeeDerivedTable
    WHERE LastName = "Smith"
    ORDER BY FirstName;

Correlated Sub-queries to correlate column values in inner and outer queries:

  SELECT DISTINCT e.LastName, e.FirstName, e.BusinessEntityID, sp.Bonus
  FROM Employee AS e
  JOIN SalesPerson AS sp ON c.BusinessEntityID = sp.BusinessEntityID
  WHERE c.Bonus >=
  ( SELECT average(sp2.Bonus)
    FROM SalesPerson sp2
    JOIN Employee AS e2 ON e2.BusinessEntityID = sp2.BusinessEntityID
    WHERE e.DepartmentID = e2.DepartmentID );

Semi/Anti Joins to test for the existence or nonexistence of a row:

  SELECT p.Name, p.Job, p.EmployerID
  FROM People AS p
  WHERE EXISTS ( SELECT 1 FROM Employers AS e WHERE e.ID = p.EmployerID );

Union All to concatenate rows from multiple tables (available in 4.4.0 release):

  SELECT * FROM Employee e
  WHERE e.BusinessEntityID IN
  ( SELECT BusinessEntityID FROM SalesPerson WHERE Ranking >= 5.0
    SELECT BusinessEntityID FROM CustomerReview WHERE Score >= 8.0 ) 

Phoenix has support for both a broadcast hash join strategy, when one side of the join is small enough to fit into memory, as well as a sort merge join strategy for cases when it won't. Put all this functionality together, and Phoenix is capable of running many TPC queries.

Here's an example query from one of the TPC queries:

  SELECT supp_nation, cust_nation, l_year, sum(volume) as revenue
    n1.n_name as supp_nation, n2.n_name as cust_nation,
    year(l_shipdate) as l_year, 
    l_extendedprice * (1 - l_discount) as volume
    FROM Supplier
    JOIN LineItem ON s_suppkey = l_suppkey
    JOIN Orders ON o_orderkey = l_orderkey
    JOIN Customer ON c_custkey = o_custkey
    JOIN Nation n1 ON s_nationkey = n1.n_nationkey
    JOIN Nation n2 ON c_nationkey = n2.n_nationkey
    WHERE ( n1.n_name = '[NATION1]' and n2.n_name = '[NATION2]' )
       OR (n1.n_name = '[NATION2]' and n2.n_name = '[NATION1]') )
       AND l_shipdate between date '1995-01-01' and date '1996-12-31'
  ) AS Shipping
  GROUP BY supp_nation, cust_nation, l_year
  ORDER BY supp_nation, cust_nation, l_year;

So, please repeat after me: "YES, Phoenix supports joins" :-)

Interested in learning more? Come hear Maryann Xue talk at the upcoming Apache Phoenix meetup on May 5th at the Hortonworks HQ in Santa Clara, or hear our joint talk at HBaseCon 2015 on May 7th where we'll discuss our plan for getting the rest of the way to complete ANSI SQL 92 support.

Tuesday February 24, 2015

Announcing Phoenix 4.3 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.3 release. Phoenix is a relational database layer on top of Apache HBase accessed as a JDBC driver for querying, updating, and managing HBase tables using SQL. The 4.x releases are compatible with HBase 0.98.1 and above.

The release includes:

  • Functional indexes to allow arbitrary expressions to be indexed
  • Map reduce over Phoenix tables
  • Cross join support
  • Query hint to force index usage
  • Set HBase properties through ALTER TABLE statement
  • ISO-8601 date format support on input
  • New RAND built-in function to generate random numbers
  • ANSI SQL date/time literal support
  • Query timeout support in JDBC Statement
  • ~90 bug fixes

Source and binary downloads are available here.

Thursday December 11, 2014

Announcing Phoenix 4.2.2 and 3.2.2 released

The Apache Phoenix team is pleased to announce the immediate availability of the 4.2.2/3.2.2 release. Phoenix is a relational database layer on top of Apache HBase accessed as a JDBC driver for querying, updating, and managing HBase tables using SQL. The 4.x releases are compatible with HBase 0.98 while the 3.x releases are compatible with HBase 0.94.

The release includes:

Source and binary downloads are available here.

Friday September 19, 2014

Getting started with Phoenix just got easier

Just wanted to pass along a very nice blog from SequenceIQ on a pre-cooked Docker that they setup for Phoenix+HBase: http://blog.sequenceiq.com/blog/2014/09/04/sql-on-hbase-with-apache-phoenix/

Saturday September 06, 2014

Connecting HBase to Elasticsearch through Phoenix in 10 min or less

It's always exciting when folks reach out to me to demonstrate new ways of leveraging Apache Phoenix in this fantastic open source ecosystem. In this case, Alex Kamil gives step-by-step instructions on usnig Phoenix with Elasticsearch to enable full-text search on HBase tables: http://lessc0de.github.io/connecting_hbase_to_elasticsearch.html


Saturday August 30, 2014

Announcing Phoenix 3.1 and 4.1 Released

The Apache Phoenix team is pleased to announce its first releases as a top level project. Phoenix is the best performing open source SQL database for Apache HBase, a NoSQL data store. Phoenix is accessed as a JDBC driver and enables querying, updating, and managing HBase tables using SQL.

Our 3.1 release (for HBase 0.94.4+) includes:

Our 4.1 release (for HBase 0.98.1+) includes all of the above plus:

  • Tracing leveraging Cloudera’s HTrace library to show per-request performance metrics all they way from the client through into the HBase server, and back again
  • Local Indexes targeting write-heavy, space-constrained use cases by co-locating index and table data on the same region server

Source and binary downloads are available here.

Wednesday May 28, 2014

OLAP with Apache Phoenix and HBase

Today's blog is brought to you by Juan Rodríguez Hortalá of LAMBDOOP

One of the benefits of having a SQL query interface for a database is that SQL has become a lingua franca that is used as the basis for the interoperability of many systems. One example of that are visualization tools, that usually accept SQL connections as data sources. In this post we'll see how to use Apache Phoenix as a bridge between HBase and Saiku Analytics, a visualization tool for Business Intelligence.

Saiku is an open source visualization tool for Online Analytical Processing (OLAP) that consists in modeling a given business process as a set of facts corresponding to business transactions, that contain measures of quantities relevant for the business, and which are categorized by several dimensions that describe the context of the transaction.

For example if we want to analyze the traffic of a website then we might define a measure ACTIVE_VISITOR for the number of visitors and put it into context by considering the date of the visit (DATE dimension), or the part of the site which was visited (FEATURE dimension).

Facts and dimensions are collected into a logical (and sometimes also physical) structure called OLAP cube, which can be thought as a multidimensional array indexed by the dimensions and with the measures as values. OLAP cubes are queried using MultiDimensional eXpressions (MDX), a query language for specifying selections and aggregations in a natural way over the multidimensional structure of a cube. And here comes Saiku! Saiku is an OLAP front-end for visualizing the results of MDX queries, both in tabular format or as different types of charts, and that also offers a drag and drop interface for generating MDX queries.

Saiku accepts several data sources, and in particular it is able to use the Mondrian server to implement ROLAP, which is a popular technique for OLAP that uses a relational database to store and compute the cubes. Did I say relational? Well, in fact all that Mondrian needs is a JDBC driver ... And this leads us to this post, where I'll share my experiences combining Phoenix with Mondrian to use HBase as the analytical database in the backend of an OLAP system with Saiku as the front-end. As there are a few layers in this setting, let's describe the flow of a query in this system before entering into more details:

  1. Saiku uses an schema file provided during configuration that contains the declarations of some OLAP cubes, and displays the dimensions and measures of the cubes to the user.

  2. The user selects a cube and builds a query using Saiku’s drag and drop interface over the cube.

  3. Saiku generates an MDX expression and passes it to Mondrian for execution.

  4. Mondrian compiles the MDX expression into an SQL query, and then uses Phoenix JDBC to execute it.

  5. HBase executes the query and then gives the results back to Phoenix.

  6. Phoenix passes the results to Mondrian, which passes them to Saiku, which then renders the results as a table in the user interface.  

Setup and example

We’ll see how to setup Saiku to work with Phoenix with a simple example. This procedure has been tested in a Cloudera Quickstart VM for CDH 4.4.0, which is shipped with HBase 0.94.6, therefore we’ll be using Phoenix 3. First of all we have to download phoenix 3 and install it by copying phoenix-core-*.jar in the classpath of each region server (for example at /usr/lib/hbase/lib/) and then restarting them. Regarding Saiku, it can be installed as a Pentaho BA Server plugin, or as a stand alone application as we'll do in this tutorial. To do that just download Saiku Server 2.5 (including Foodmart DB) and extract the compressed file, which contains a Tomcat distribution with Saiku installed as a webapp. To start and stop Saiku use the scripts start-saiku.sh and stop-saiku.sh. By default Saiku runs at port 8080 and uses ‘admin’ as both user and password for the login page. After login you should see be able to select one of the cubes of the Foodmart DB at the combo at the upper left corner of the UI.

Now it’s time to connect Saiku to Phoenix. The first step is defining the structure of the OLAP cube we are going to analyze. In the example above we considered analyzing the traffic of a website by measuring quantities like the number of visitors along dimensions like the date of the visit or the part of the site that was visited. This happens to correspond to the structure of the PERFORMANCE tables created by the script bin/performance.py shipped with Phoenix. So let’s run the command bin/performance.py localhost 100000 so a table PERFORMANCE_100000 with 100.000 rows with the following schema is created:






) SPLIT ON ('CSGoogle','CSSalesforce','EUApple','EUGoogle',

    'EUSalesforce', 'NAApple','NAGoogle','NASalesforce');

In this table the 4 columns that compose the primary key correspond to the dimensions of the cube, while the other 3 columns are the quantities to be measured. That is a fact table with with 4 degenerate dimensions (see e.g. Star Schema The Complete Reference for more details about dimensional modeling), that represents and OLAP cube. Nevertheless there is a problem with this cube: the granularity for the time dimension it’s too fine, as we can see in the following query.

0: jdbc:phoenix:localhost> SELECT TO_CHAR(DATE) FROM PERFORMANCE_1000000 LIMIT 5;




| 2014-05-15 09:47:42 |

| 2014-05-15 09:47:50 |

| 2014-05-15 09:49:09 |

| 2014-05-15 09:49:23 |

| 2014-05-15 09:49:44 |


5 rows selected (0.161 seconds)

We have a resolution of seconds in the DATE column, so if we grouped values by date we would end up with a group per record, hence this dimensions is useless. To fix this we just have to add a new column with a coarser granularity, for example of days. To do that we can use the following script, that should be invoked as bash proc_performance_table.sh <path to phoenix bin directory> PERFORMANCE_100000.


if [ $# -ne 2 ]


echo "Usage: $0 <phoenix bin directory> <table name>"

exit 1





echo "Table $OUT_TABLE will be created"


./sqlline.py localhost <<END












('CSGoogle','CSSalesforce','EUApple','EUGoogle','EUSalesforce', 'NAApple','NAGoogle','NASalesforce');

!autocommit on








echo "Done"


That creates a new table PERFORMANCE_100000_PROC with an extra column DATE_DAY for the date with a resolution of days. We’ll perform the analysis over this table. The next step is creating a Mondrian schema that maps an OLAP cube to this table, for Mondrian to know which column to use as a dimension, and how to define the measures. Create a file Phoenix.xml with the following contents:

<?xml version="1.0"?>

<Schema name="Phoenix">

<Cube name="PERFORMANCE_100000_PROC">

<Table name="PERFORMANCE_100000_PROC"/>

<Dimension name="HOST">

 <Hierarchy hasAll="true" allMemberName="All Types">

   <Level name="HOST" column="HOST" uniqueMembers="false"/>



<Dimension name="DOMAIN">

 <Hierarchy hasAll="true" allMemberName="All Types">

   <Level name="DOMAIN" column="DOMAIN" uniqueMembers="false"/>



<Dimension name="FEATURE">

 <Hierarchy hasAll="true" allMemberName="All Types">

   <Level name="FEATURE" column="FEATURE" uniqueMembers="false"/>



<Dimension name="DATE">

 <Hierarchy hasAll="true" allMemberName="All Types">

   <Level name="DAY" column="DATE_DAY" uniqueMembers="false"/>



<Measure name="CORE" column="CORE" aggregator="sum" formatString="Standard"/>

<Measure name="DB" column="DB" aggregator="sum" formatString="Standard"/>

<Measure name="ACTIVE_VISITOR" column="ACTIVE_VISITOR" aggregator="sum" formatString="Standard"/>



There we have declared an schema “Phoenix” containing a single cube “PERFORMANCE_100000_PROC” defined over the table “PERFORMANCE_100000_PROC”. This cube has 4 dimensions with a single level, that corresponds to the columns HOST, DOMAIN, FEATURE and DATE_DAY, and 3 metrics that are computed adding the values in the groups determined by the dimensions, over the columns  CORE, DB and ACTIVE_VISITOR. As we can see in tomcat/webapps/saiku/WEB-INF/lib/mondrian-3.5.7.jar, this version of  Saiku (2.5) uses Mondrian 3, so take a look at the Mondrian 3 documentation or to Mondrian in Action for more information about Mondrian schemas if you are interested.

Now we have to create a new Saiku data source that uses this schema, by creating a file <saiku install directory>/tomcat/webapps/saiku/WEB-INF/classes/saiku-datasources/phoenix with the following contents:





That declares an OLAP connection named phoenix that uses Mondrian as the driver, and Phoenix as JDBC. We just need to do a couple more things for this to work:

  • Add the jars for Phoenix, HBase and Hadoop to Saiku’s classpath, by copying phoenix-*client-without-hbase.jar, /usr/lib/hbase/hbase.jar, /usr/lib/hadoop/hadoop-common.jar and /usr/lib/hadoop/hadoop-auth.jar into  <saiku install directory>/tomcat/lib.

  • Create a directory <saiku install directory>/tomcat/webapps/saiku/WEB-INF/classes/phoenix and copy the mapping file Phoenix.xml into it.

Now restart Saiku and the new cube should appear. I suggest watching tomcat/logs/catalina.out and tomcat/logs/saiku.log at the Saiku install path for possible errors and other information.

If we take a look to saiku.log we can see the MDX query generated by Saiku, and the running time for the query in HBase. These execution times are for a Cloudera Quickstart VM running on VMware Player with 4448 MB RAM and 4 cores, in a laptop with an i7 processor and 8 GB RAM, and with default settings for HBase, Phoenix and Saiku.

2014-05-18 09:20:58,225 DEBUG [org.saiku.web.rest.resources.QueryResource] TRACK    /query/CC92DB19-D97E-E68D-B927-E6259DB6BF70/resultflattened    GET

2014-05-18 09:20:58,226 INFO  [org.saiku.service.olap.OlapQueryService] runId:2    Type:QM:


NON EMPTY {[Measures].[CORE], [Measures].[DB], [Measures].[ACTIVE_VISITOR]} ON COLUMNS,

NON EMPTY CrossJoin([DATE].[DAY].Members, CrossJoin([DOMAIN].[DOMAIN].Members, CrossJoin([FEATURE].[FEATURE].Members, [HOST].[HOST].Members))) ON ROWS


2014-05-18 09:20:59,220 INFO  [org.saiku.service.olap.OlapQueryService] runId:2    Size: 7/55    Execute:    970ms    Format:    24ms     Total: 994ms

The same query for a performance table of 1 million records is completed in an order of 8 to 18 seconds in the same setting.

That's all! The next natural step would be defining a richer star schema with independent dimension tables, and testing the limits of Phoenix's star-join optimization. I hope this post has encouraged you to try it yourself.

Friday May 23, 2014

Apache Phoenix graduates as top level project

I'm pleased to announce that Apache Phoenix, the SQL database engine over HBase, has graduated as a top level Apache project. Congratulations to the entire team!

Saturday April 12, 2014

Apache Phoenix releases next major version

The Apache Phoenix team is pleased to announce the release of its next major versions: 3.0.0 and 4.0.0 from the Apache Incubator. Phoenix is a SQL query engine for Apache HBase, a NoSQL data store. It is accessed as a JDBC driver and enables querying and managing HBase tables using SQL.

Major new features include:

Phoenix now supports the HBase 0.98 branch (0.98.1+) in its 4.0 release while the 3.0 release continues to support the HBase 0.94 branch (0.94.4+).

Downloaded is available here.



Hot Blogs (today's hits)

Tag Cloud