Apache Drill Blog

Friday September 12, 2014

Announcing the Apache Drill Beta Release, Self Service Data Exploration in Action

It is our pleasure to announce the 0.5.0 release of Apache Drill.  This is Drill’s first beta release and the second in our iterative monthly release cycle. It includes more than 100 issues addressed since last month’s release and more than 1,000 addressed since Drill’s inception, this is a great release to start exploring your data, wherever and whatever it is.

For more background on what Drill is about, check out the Drill overview or Drill in 10 minutes. The 0.5.0 release builds upon the huge 0.4.0 release so you should refer to last month’s release for information on all the functionality available. Notable features included in 0.5.0 include the following:

  • Drill now uses the Hadoop 2.4.1 APIs.  This includes upgrading Parquet to use direct memory and the ability to write larger Parquet files when using CREATE TABLE AS.
  • Improved JOIN planning when using HBase tables based on row count approximations using region level statistics.
  • Improved handling of large sorts and out of memory conditions.
  • JSON projection pushdown, an all text JSON mode and boolean short circuit. Each of these features allow more flexibility when interacting with complicated JSON files
  • Substantial improvements in SELECT * handling when interacting with schemaless data sources.
  • Creation of a self contained JDBC JAR file to ease access to Drill from JDBC tools.
  • Fully distributed execution of all basic aggregates including standard deviation and avg.

Drill will continue on its march towards GA with upcoming monthly releases continuing to harden and expand Drill’s capabilities and performance. Check out the release notes, download it, or better yet, make your own fork and contribute back to the community. Together, we can make data available to everyone, anywhere.

-The Apache Drill Team  

Friday August 08, 2014

Announcing Apache Drill 0.4.0: Self Service Data Exploration

We’re very excited to announce the release of Apache Drill 0.4.0.  This release is a developer preview release and is the first in a series of monthly builds as Drill drives towards Beta and GA. Although this is only the second incubator release, the growth and strength of the Apache Drill community is already apparent. This release contains more than 800 JIRAs and 100,000 lines of new code from 40+ contributors in 15+ organizations.

Apache Drill was founded with the audacious goal of redefining analytics for flexibility using modern data formats while establishing a new benchmark for performance.  Rather than re-implementing technologies and approaches from 30 years ago, Drill focuses on redefining the nature of data and metadata and strives to combine SQL, NoSQL and document database approaches in a single set of query capabilities. This release starts to deliver on these goals by allowing you to start experimenting with Drill’s new instant, no-setup analysis paradigm.

At its core, Drill was designed for ease of use and self-service data exploration.  That means little setup, embracing convention over configuration and allowing a user to experiment on any platform. Unlike most systems in the Hadoop ecosystem, Apache Drill is software that you can start using on your desktop in just a couple of minutes. Just find a JSON or CSV file that you want to analyze, download Drill and execute your first query. For more details, see the Drill in 10 minutes section of the documentation.

This is a huge release that comes with a large number of changes and new features.  Some highlights include:

  • A new way to work with data and metadata including the first query engine to champion advanced Apache Parquet format files to support self-describing data, completely avoiding a central metadata repository.
  • A completely new columnar execution engine that leverages both runtime code compilation and advanced memory management for query execution.
  • Advanced cost-based query optimization that works with or without stats providing complex distributed query planning.
  • Focus on full SQL capability with support for correlated subqueries, complex subexpressions and scalar subqueries.
  • The first query engine to support JSON everything, enabling instant analysis of semi-structured and partially schemed data without setup or extra effort.
  • Full complex data semantics combined with complete SQL data types allow you to use JavaScript notation to access and interact with complex fields and data structures.  This includes support for exact Decimal, Date, Time and Interval types.
  • In-query dynamic schema discovery allows you to redefine blob fields as complex objects, using advanced CONVERT_FROM and CONVERT_TO semantics.
  • Support for more than 150 data formats and thousands of existing function libraries through strong integration with Hive Serdes and UDFs.
  • Additional support for high performance native Drill storage plugins and UDFs.
  • A friendly web interface with query and profiling tools including an advanced query plan visualizer and execution flow visualizations.
  • A complete set of interfaces and APIs including support for JDBC, C++, Java, ODBC*, REST and CLI
  • Advanced dynamic analysis capabilities on top of HBase including dynamic schema discovery, high speed parallel scanning and operator pushdown.
  • Support for in-memory and beyond memory datasets with an multi-staged innovative sort algorithm that produces faster time to first record sorting than traditional query engines.
  • Ability to meet query SLAs and avoid resource starvation with multiple query resource queues.
  • Support for wide rows with thousands of columns within a single query.
  • An advanced modular design with extensibility points at storage, query, planning and operator execution to work for a large set of standalone or embedded setups.
  • Full scaling: run embedded on Linux, Mac or PC for development purposes or scale up to a full cluster on any platform.
  • Support for use of Zookeeper and HBase for Drill configuration and profiling management.
  • The only open source distributed query engine architected to work with all types of big data, not just Hadoop data sources.
  • Lots more that you can read about by reviewing the Apache Drill Wiki

Be forewarned: this release is a developer preview release, so you’ll see lots of sharp edges and bugs. This release also has a large amount of debugging options enabled and a number of performance optimizations disabled.  We recommend against using this release for anything more than initial experimentation.  We’re now rolling out monthly releases so you can get a flavor for Drill now and start evaluating it for production use cases in the next couple of months. 

Thanks to all the people and supporting organizations who have worked to help create this release.  Together, we’ll enable better, faster decisions while making analysts’ lives easier.  

-The Apache Drill Team 

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation