Entries tagged [hadoop]

Wednesday January 23, 2019

The Apache Software Foundation Announces Apache® Hadoop® v3.2.0

Pioneering Open Source distributed enterprise framework powers US$166B Big Data ecosystem

Wakefield, MA —23 January 2019— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, today announced Apache® Hadoop® v3.2.0, the latest version of the Open Source software framework for reliable, scalable, distributed computing.

Now in its 11th year, Apache Hadoop is the foundation of the US$166B Big Data ecosystem (source: IDC) by enabling data applications to run and be managed on large hardware clusters in a distributed computing environment. "Apache Hadoop has been at the center of this big data transformation, providing an ecosystem with tools for businesses to store and process data on a scale that was unheard of several years ago," according to Accenture Technology Labs.

"This latest release unlocks the powerful feature set the Apache Hadoop community has been working on for more than nine months," said Vinod Kumar Vavilapalli, Vice President of Apache Hadoop. "It further diversifies the platform by building on the cloud connector enhancements from Apache Hadoop 3.0.0 and opening it up for deep learning use-cases and long-running apps."

Apache Hadoop 3.2.0 highlights include:
  • ABFS Filesystem connector —supports the latest Azure Datalake Gen2 Storage;
  • Enhanced S3A connector —including better resilience to throttled AWS S3 and DynamoDB IO;
  • Node Attributes Support in YARN —helps to tag multiple labels on the nodes based on its attributes and supports placing the containers based on expression of these labels;
  • Storage Policy Satisfier  —supports HDFS (Hadoop Distributed File System) applications to move the blocks between storage types as they set the storage policies on files/directories; 
  • Hadoop Submarine —enables data engineers to easily develop, train and deploy deep learning models (in TensorFlow) on very same Hadoop YARN cluster;
  • C++ HDFS client —helps to do async IO to HDFS which helps downstream projects such as Apache ORC;
  • Upgrades for long running services —supports in-place seamless upgrades of long running containers via YARN Native Service API (application program interface) and CLI (command-line interface).

"This is one of the biggest releases in Apache Hadoop 3.x line which brings many new features and over 1,000 changes," said Sunil Govindan, Apache Hadoop 3.2.0 release manager. "We are pleased to announce that Apache Hadoop 3.2.0 is available to take your data management requirements to the next level. Thanks to all our contributors who helped to make this release happen."

Apache Hadoop is widely deployed at numerous enterprises and institutions worldwide, such as Adobe, Alibaba, Amazon Web Services, AOL, Apple, Capital One, Cloudera, Cornell University, eBay, ESA Calvalus satellite mission, Facebook, foursquare, Google, Hortonworks, HP, Huawei, Hulu, IBM, Intel, LinkedIn, Microsoft, Netflix, The New York Times, Rackspace, Rakuten, SAP, Tencent, Teradata, Tesla Motors, Twitter, Uber, and Yahoo. The project maintains a list of educational and production users, as well as companies that offer Hadoop-related services at https://wiki.apache.org/hadoop/PoweredBy

Global Knowledge hails, "...the open-source Apache Hadoop platform changes the economics and dynamics of large-scale data analytics due to its scalability, cost effectiveness, flexibility, and built-in fault tolerance. It makes possible the massive parallel computing that today's data analysis requires."

Hadoop is proven at scale: Netflix captures 500+B daily events using Apache Hadoop. Twitter uses Apache Hadoop to handle 5B+ sessions a day in real time. Twitter’s 10,000+ node cluster processes and analyzes more than a zettabyte of raw data through 200B+ tweets per year. Facebook’s cluster of 4,000+ machines that store 300+ petabytes is augmented by 4 new petabytes of data generated each day. Microsoft uses Apache Hadoop YARN to run the internal Cosmos data lake, which operates over hundreds of thousands of nodes and manages billions of containers per day.

Transparency Market Research recently reported that the global Hadoop market is anticipated to rise at a staggering 29% CAGR with a market valuation of US$37.7B by the end of 2023.

Apache Hadoop remains one of the most active projects at the ASF: it ranks #1 for Apache project repositories by code commits, and is the #5 repository by size (3,881,797 lines of code).

"The Apache Hadoop community continues to go from strength to strength in further driving innovation in Big Data," added Vavilapalli. "We hope that developers, operators and users leverage our latest release in fulfilling their data management needs."

Catch Apache Hadoop in action at the Strata conference, 25-28 March 2019 in San Francisco, and dozens of Hadoop MeetUps held around the world, including on 30 January 2019 at LinkedIn in Sunnyvale, California.

Availability and Oversight
Apache Hadoop software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Hadoop, visit http://hadoop.apache.org/ and https://twitter.com/hadoop

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 730 individual Members and 7,000 Committers across six continents successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official global conference series. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Aetna, Alibaba Cloud Computing, Anonymous, ARM, Baidu, Bloomberg, Budget Direct, Capital One, Cerner, Cloudera, Comcast, Facebook, Google, Handshake, Hortonworks, Huawei, IBM, Indeed, Inspur, LeaseWeb, Microsoft, Oath, ODPi, Pineapple Fund, Pivotal, Private Internet Access, Red Hat, Target, Tencent, and Union Investment. For more information, visit http://apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Hadoop", "Apache Hadoop", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Wednesday April 18, 2018

The Apache Software Foundation Announces Apache® Oozie(TM) v5.0.0

Open Source workflow scheduler for Apache Hadoop used to build complex Big Data transformations.

Wakefield, MA —18 April 2018— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today Apache® OozieTM v5.0.0, the workflow scheduler for Apache Hadoop.

Apache Oozie is a scalable, reliable, and extensible Java Web application used for job workflow scheduling and operational services management within an Apache Hadoop cluster. Integrated with the Hadoop stack, Oozie supports jobs for Apache projects such as Spark, Hive, MapReduce, Pig, and Sqoop, and can also schedule system-specific jobs, such as Java programs and shell scripts. The project entered the Apache Incubator in 2011, and graduated as an Apache Top-Level Project in 2012.

"Apache Oozie 5's flagship feature, Oozie on YARN, started off as a 1 day hackathon project almost 4 years ago, and it's great to see that the Oozie community has taken it on and made it ready for everyone to use," said Robert Kanter, Vice President of Apache Oozie. "It's a big change to Oozie's architecture, and I think our users are going to be very happy with the benefits it brings."

Apache Oozie allows cluster administrators to build complex Big Data transformations out of multiple component tasks. This provides greater control over jobs and also makes it easier to repeat those jobs at predetermined intervals. 

Oozie combines multiple jobs sequentially into one logical unit of work through 1) Oozie Workflow jobs -- Directed Acyclic Graphs (DAGs) of actions; and 2) Oozie Coordinator jobs -- recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Apache Oozie 5.0.0 includes new features, bug fixes and minor improvements that include:
  • moved launcher from MapReduce mapper to YARN ApplicationMaster;
  • switched from Tomcat 6 to embedded Jetty 9;
  • updated third party libraries;
  • completely rewritten workflow graph generator;
  • JDK 8 support;
  • deprecated Instrumentation in favor of Metrics;
  • added indexes to speed up DB queries; and 
  • fixed CVE-2017-15712

The full list of new features can be found in the project release notes at https://oozie.apache.org/docs/5.0.0/release-log.txt

"Oozie 5 is a major milestone for the project," said Andras Piros, Apache Oozie committer and Apache Oozie v5.0 Release Manager. "We are proud to provide all the new functionality to big data administrators, data engineers, and data scientists who can leverage a faster, more streamlined, and more secure workflow orchestrator. Features like Oozie on YARN, Jetty 9 support, and ecosystem revamp enable Apache Hadoop users to create and schedule Hadoop jobs in an efficient and modern way not seen before."

"Oozie has long been a staple of a productive Apache Hadoop deployment, playing an important role in orchestrating the rest of the ecosystem. Oozie 5 represents the next step in where Oozie is headed," added Kanter. "The Apache Oozie community has already got some great features in the works for our next release. We welcome anyone who wants to contribute to join us in making Oozie the best it can be."

Availability and Oversight
Apache Oozie software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Oozie, visit http://oozie.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 680 individual Members and 6,500 Committers across six continents successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Aetna, Alibaba Cloud Computing, ARM, Baidu, Bloomberg, Budget Direct, Capital One, Cash Store, Cerner, Cloudera, Comcast, Facebook, Google, Hortonworks, Huawei, IBM, Indeed, Inspur, iSIGMA, ODPi, LeaseWeb, Microsoft, PhoenixNAP, Pivotal, Private Internet Access, Red Hat, Target, Union Investment, and Yahoo. For more information, visit http://apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Oozie", "Apache Oozie", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Wednesday January 10, 2018

The Apache Software Foundation Announces Apache® Trafodion™ as a Top-Level Project

Mature Big Data database management system for working in SQL at Apache Hadoop-scale levels in use China Mobile, China Unicom, Dell EMC, Esgyn Corporation, and Millersoft Limited, among others.

Forest Hill, MD —10 January 2018— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Trafodion™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.

Apache Trafodion extends Apache Hadoop to guarantee transactional integrity and operational workloads for new kinds of Big Data applications that run on Hadoop.

 "We are very excited to have been established as an Apache Top-Level Project," said Pierre Smits, Vice President of Apache Trafodion. "Graduation is a terrific milestone that culminates 2.5 years of contributions from around the globe to establishing a growing community committed to delivering a high-grade OLTP solution on top of the Apache Hadoop ecosystem."

Building on the scalability, elasticity, and flexibility of Hadoop, Trafodion (meaning "transactions" in Welsh) is the first integrated Open Source solution that delivers on the promise of integrated transactional and analytical systems (OLTP/OLAP) for Apache Hadoop. Trafodion's features include:
  • Fully functional ANSI SQL support, leveraging existing SQL skills;
  • Distributed ACID data protection, guaranteeing data consistency across multiple tables and rows;
  • Compile-Time and Run-Time Optimizers, delivering performance improvements for OLTP workloads;
  • Parallel-aware Query Optimizer, supporting large data sets;
  • Apache Spark integration, supporting streaming analysis;
  • Interoperability with existing Apache Hadoop tools and solutions, such as Hive, Ambari, Flume, Kafka, and Oozie; and 
  • Apache Hadoop and Linux distribution neutrality.

Trafodion originated at HP-IT in 2013, and was donated to the Apache Incubator in May 2015. The project has had four official releases since entering the Apache Incubator. 

Apache Trafodion is in use at China Mobile, China Unicom, Dell EMC, Esgyn Corporation, and Millersoft Limited, among others.

"As a member of the HP Core Team responsible for releasing Trafodion to The Apache Software Foundation, and responsible for the project’s name, I'm thrilled to see the Trafodion community be recognized with this major achievement. Congratulations to all who made it possible," said Ken Holt, COO at Esgyn Corporation. "Trafodion is the heart of EsgynDB, and the community is like its lifeblood — we at Esgyn are committed to continue to grow and support the community."

"Congratulations to the Trafodion community for becoming an Apache Top-Level Project," said Tianduo Gao, Senior Development Engineer of Software Technology (Suzhou) at China Mobile. "We are planning to use Trafodion to expand the business of China Mobile's Big Data platform: our data statistics of 4G real-time business in the country and provinces are more efficient than ever before."

"Becoming a core Apache Project is a major step forward for Trafodion. It will give Millersoft the confidence to introduce the technology to our Big Data clients," said Calum Miller, Director of Millersoft Limited. "Testing of our Open Source Data Vault engine running on top of Apache Trafodion is going well and we look forward to announcing a fully integrated product shortly."

"Apache Trafodion enhanced the operational efficiency of our Big Data platforms, and brought us better customer experience and broader application scenarios," said Charles Yu, Managing Director, Application Services at Dell EMC.

"Congratulations to Trafodion for officially becoming part of the Apache open source ecosystem," said Qingquan Gu, Senior Development Engineer of Internet of Things Marketing Service Center at China Unicom. "Using Trafodion provided China Unicom with the ability to build and integrate Big Data platforms, enhanced our operational efficiency, and brought us better customer experience."

"Becoming an Apache Top-Level Project is only the beginning," added Smits. "We are looking forward to growing the Trafodion community, reaching new adopters and contributors, and fostering a strong ecosystem around the project."

Availability and Oversight
Apache Trafodion software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Trafodion, visit http://trafodion.apache.org/ and https://twitter.com/Trafodion

About the Apache Incubator
The Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 680 individual Members and 6,300 Committers across six continents successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Capital One, Cash Store, Cerner, Cloudera, Comcast, Facebook, Google, Hewlett Packard, Hortonworks, Huawei, IBM, Inspur, iSIGMA, ODPi, LeaseWeb, Microsoft, PhoenixNAP, Pivotal, Private Internet Access, Red Hat, Serenata Flowers, Target, Union Investment, WANdisco, and Yahoo. For more information, visit http://apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Trafodion", "Apache Trafodion", "Hadoop", "Apache Hadoop", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Thursday December 14, 2017

The Apache Software Foundation Announces Apache® Hadoop® v3.0.0 General Availability

Ubiquitous Open Source enterprise framework maintains decade-long leading role in $100B annual Big Data market

Forest Hill, MD —14 December 2017— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, today announced Apache® Hadoop® v3.0.0, the latest version of the Open Source software framework for reliable, scalable, distributed computing.

Over the past decade, Apache Hadoop has become ubiquitous within the greater Big Data ecosystem by enabling firms to run and manage data applications on large hardware clusters in a distributed computing environment.

"This latest release unlocks several years of development from the Apache community," said Chris Douglas, Vice President of Apache Hadoop. "The platform continues to evolve with hardware trends and to accommodate new workloads beyond batch analytics, particularly real-time queries and long-running services. At the same time, our Open Source contributors have adapted Apache Hadoop to a wide range of deployment environments, including the Cloud."

"Hadoop 3 is a major milestone for the project, and our biggest release ever," said Andrew Wang, Apache Hadoop 3 release manager. "It represents the combined efforts of hundreds of contributors over the five years since Hadoop 2. I'm looking forward to how our users will benefit from new features in the release that improve the efficiency, scalability, and reliability of the platform."

Apache Hadoop 3.0.0 highlights include:
  • HDFS erasure coding —halves the storage cost of HDFS while also improving data durability;
  • YARN Timeline Service v.2 (preview) —improves the scalability, reliability, and usability of the Timeline Service;
  • YARN resource types —enables scheduling of additional resources, such as disks and GPUs, for better integration with machine learning and container workloads;
  • Federation of YARN and HDFS subclusters transparently scales Hadoop to tens of thousands of machines;
  • Opportunistic container execution improves resource utilization and increases task throughput for short-lived containers. In addition to its traditional, central scheduler, YARN also supports distributed scheduling of opportunistic containers; and 
  • Improved capabilities and performance improvements for cloud storage systems such as Amazon S3 (S3Guard), Microsoft Azure Data Lake, and Aliyun Object Storage System.

Hadoop 3.0.0 has already undergone extensive testing and integration with the broader Open Source ecosystem at The Apache Software Foundation. With this release, its community of developers and users promote this release series out of beta.

Apache Hadoop is widely deployed at numerous enterprises and institutions worldwide, such as Adobe, Alibaba, Amazon Web Services, AOL, Apple, Capital One, Cloudera, Cornell University, eBay, ESA Calvalus satellite mission, Facebook, foursquare, Google, Hortonworks, HP, Hulu, IBM, Intel, LinkedIn, Microsoft, Netflix, The New York Times, Rackspace, Rakuten, SAP, Tencent, Teradata, Tesla Motors, Twitter, Uber, and Yahoo. The project maintains a list of known users at https://wiki.apache.org/hadoop/PoweredBy

"It's tremendous to see this significant progress, from the raw tool of eleven years ago, to the mature software in today's release," said Doug Cutting, original co-creator of Apache Hadoop. "With this milestone, Hadoop better meets the requirements of its growing role in enterprise data systems.  The Open Source community continues to respond to industrial demands."

Apache Hadoop's diverse community enjoys continued growth amongst the ASF's most active projects, and remains at the forefront of more than three dozen Apache Big Data projects.

Apache Hadoop committer history

Apache Hadoop has received countless awards, including top prizes at the Media Guardian Innovation Awards and Duke's Choice Awards, and has been hailed by industry analysts:

"...the lifeblood of organizational analytics…" —Gartner

"Hadoop Is Here To Stay" —Forrester

"...today Hadoop is the only cost-sensible and scalable open source alternative to commercially available Big Data management packages. It also becomes an integral part of almost any commercially available Big Data solution and de-facto industry standard for business intelligence (BI)." —MarketAnalysis.com/Market Research Media

"...commanding half of big data’s $100 billion annual market value...Hadoop is the go-to big data framework." —BigDataWeek.com

"Hadoop, and its associated tools, is currently the 'big beast' of the big data world and the Hadoop environment is undergoing rapid development..." —Bloor Research


"The opportunity to effect meaningful, even fundamental change in the Apache Hadoop project remains open," added Douglas. "Our new contributors uprooted the project from its historical strength in Web-scale analytics by introducing powerful, proven abstractions for data management, security, containerization, and isolation. Apache Hadoop drives innovation in Big Data by growing its community. We hope this latest release continues to draw developers, operators, and users to the ASF."

Catch Apache Hadoop in action at the Strata Data Conference in San Jose, CA, 5-8 March 2018, and at dozens of Hadoop Meetups held around the world.

Availability and Oversight
Apache Hadoop software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Hadoop, visit http://hadoop.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server —the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 680 individual Members and 6,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Capital One, Cash Store, Cerner, Cloudera, Comcast, Facebook, Google, Hortonworks, Huawei, IBM, Inspur, iSIGMA, ODPi, LeaseWeb, Microsoft, PhoenixNAP, Pivotal, Private Internet Access, Red Hat, Serenata Flowers, Target, Union Investment, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Hadoop", "Apache Hadoop", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Tuesday November 28, 2017

The Apache Software Foundation Announces Apache® Impala™ as a Top-Level Project

High performance analytic database for Apache Hadoop in-Cloud or on-premises in use at Caterpillar, Cox Automotive, Jobrapido, Marketing Associates, the New York Stock Exchange, phData, and Quest Diagnostics, among others.

Forest Hill, MD —28 November 2017— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Impala™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.
Apache Impala is a modern, high-performance analytic database for Apache Hadoop. The massively parallel processing (MPP) SQL query engine allows for analytical queries on data stored on-premises (in HDFS or Apache Kudu) or in Cloud object storage via SQL or business intelligence tools without having to migrate data sets into specialized systems or proprietary formats.

"The Impala project has grown a lot since we entered incubation in December 2015," said Jim Apple, Vice President of Apache Impala. "With the help of our mentors and the Incubator, we have grown as a community and adopted the Apache Way, all while the Impala contributors have helped make Impala more stable and performant."

In addition to using the same unified storage platform as other Hadoop components, Impala also uses the same metadata, SQL syntax (Apache Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Hive. This provides a familiar and unified platform for real-time or batch-oriented queries. Impala provides:
  • A familiar SQL interface that data scientists and analysts already know;
  • The ability to query high volumes of data (Big Data) in Apache Hadoop;
  • Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective commodity hardware;
  • The ability to share data files between different components with no copy or export/import step; for example, to write with Apache Pig, transform with Hive and query with Impala. Impala can read from and write to Hive tables, enabling simple data interchange using Impala for analytics on Hive-produced data; and
  • A single system for big data processing and analytics, so customers can avoid costly modeling and ETL just for analytics.

Impala was inspired by Google's F1 database, which also separates query processing from storage management. It was originally released in 2012 and entered the Apache Incubator in December 2015. The project has had four releases during its incubation process.

"In 2011, we started development of Impala in order to make state-of-the-art SQL analytics available to the user community as open-source technology," said Marcel Kornacker, original founder of the Impala project. "The graduation to an Apache Top-Level Project is a recognition of the exceptional developer community that stands behind this project."

Apache Impala is deployed across a number of industries such as financial services, healthcare, and telecommunications, and is in use at companies that include Caterpillar, Cox Automotive, Jobrapido, Marketing Associates, the New York Stock Exchange, phData, and Quest Diagnostics. In addition, Impala is shipped by Cloudera, MapR, and Oracle.

"Apache Impala is our interactive SQL tool of choice. Over 30 phData customers have it deployed to production," said Brock Noland, Chief Architect at phData. "Combined with Apache Kudu for real-time storage, Impala has made architecting IoT and Data Warehousing use-cases dead simple. We can deploy more production use-cases with fewer people, delivering increased value to our customers. We're excited to see Impala graduate to a top-level project and look forward to contributing to its success."

"We use Apache Impala to boost performance of our SQL queries against our data lake," said Matteo Coloberti, Head of Analytics at Jobrapido. "Impala is an incredible service that gives us impressive performance on queries."

"We used to distribute Microsoft Excel reports to clients every one or two days but now they can search on their own by customer, sales deal, or even service type," said Andy Frey, CTO of Marketing Associates. "Apache Impala is used to query millions of rows to identify specific records that match the clients' criteria. We've even given clients a 'Query Hadoop' option that allows them to create simple SQL statements and query Hadoop directly via Impala. We're able to offer a faster, richer, and more accurate selection of services without the labor or latency concerns that we used to have."

"The Apache Impala community is growing, and we welcome new contributors to join in our efforts in our code, documentation, issue tracker, and discussion forums," added Apple.

Catch Apache Impala in action at Not Another Big Data Conference, taking place 12 December 2017 in Palo Alto, CA.

Availability and Oversight
Apache Impala software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Impala, visit http://impala.apache.org/ and https://twitter.com/ApacheImpala

About the Apache Incubator
The Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 680 individual Members and 6,300 Committers across six continents successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Capital One, Cash Store, Cerner, Cloudera, Comcast, Facebook, Google, Hewlett Packard, Hortonworks, Huawei, IBM, Inspur, iSIGMA, ODPi, LeaseWeb, Microsoft, PhoenixNAP, Pivotal, Private Internet Access, Red Hat, Serenata Flowers, Target, WANdisco, and Yahoo. For more information, visit http://apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Impala", "Apache Impala", "Hadoop", "Apache Hadoop", "Hive", "Apache Hive", "Kudu", "Apache Kudu", "Pig", "Apache Pig", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Monday June 05, 2017

The Apache Software Foundation Announces Momentum With Apache® Hadoop® v2.8

Major release of the cornerstone of the Big Data ecosystem, from which dozens of Apache Big Data projects and countless industry solutions originate.

Forest Hill, MD —5 June 2017— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today momentum with Apache® Hadoop® v2.8, the latest version of the Open Source software framework for reliable, scalable, distributed computing.

Now ten years old, Apache Hadoop dominates the greater Big Data ecosystem as the flagship project and community amongst the ASF's more than three dozen projects in the category.

"Apache Hadoop 2.8 maintains the project's momentum in its stable release series," said Chris Douglas, Vice President of Apache Hadoop. "Our community of users, operators, testers, and developers continue to evolve the thriving Big Data ecosystem at the ASF. We're committed to sustaining the scalable, reliable, and secure platform our greater Hadoop community has built over the last decade."

Apache Hadoop supports processing and storage of extremely large data sets in a distributed computing environment. The project has been regularly lauded by industry analysts worldwide for driving market transformation. Forrester Research estimates that firms will spend US$800M in Hadoop software and related services in 2017. According to Zion Market Research, the global Hadoop market is expected to reach approximately US$87.14B by 2022, growing at a CAGR of around 50% between 2017 and 2022.

Apache Hadoop 2.8 is the result of 2 years of extensive collaborative development from the global Apache Hadoop community. With 2,914 commits as new features, improvements and bug fixes since v2.7, highlights include:
  • Several important security related enhancements, including Hadoop UI protection of Cross-Frame Scripting (XFS) which is an attack that combines malicious JavaScript with an iframe that loads a legitimate page in an effort to steal data from an unsuspecting user, and Hadoop REST API protection of Cross site request forgery (CSRF) attack which attempt to force an authenticated user to execute functionality without their knowledge.

  • Support for Microsoft Azure Data Lake as a source and destination of data. This benefits anyone deploying Hadoop in Microsoft's Azure Cloud. The Azure Data Lake service was actually developed for Hadoop and analytics workloads.

  • The "S3A" client for working with data stored in Amazon S3 has been radically enhanced for scalability, performance, and security. The performance enhancements were driven by Apache Hive and Apache Spark benchmarks. In Hive TCP-DS benchmarks, Apache Hadoop is currently faster working with columnar data stored in S3  than Amazon EMR's closed-source connector. This shows the benefit of collaborative Open Source development.

  • Several WebHDFS related enhancements include integrated CSRF prevention filter in WebHDFS, support OAuth2 in WebHDFS, disallow/allow snapshots via WebHDFS, and more.

  • Integration with other applications has been improved with a separate jar for the hdfs-client than the hadoop-hdfs JAR with all the server side code. Downstream projects that access HDFS can depend on the hadoop-hdfs-client module to reduce the amount of transitive classpath dependencies.

  • YARN NodeManager Resource Reconfiguration through RM Admin CLI for a live cluster that allows YARN clusters to have a more flexible resource model especially for a Cloud deployment.

In addition to physical Hadoop clusters, where the majority of storage and computation lies, Apache Hadoop is very popular within Cloud infrastructures. Contributions from Apache Hadoop's diverse community includes improvements provided by Cloud infrastructure vendors and large Hadoop-in-Cloud users. These improvements include: Azure and S3 storage and YARN reconfiguration in particular, improve Hadoop's deployment on and integration with Cloud Infrastructures. The improvements in Hadoop 2.8 enable Cloud-deployed clusters to be more dynamic in sizing, adapting to demand by scaling up and down.

"My colleagues and I are happy that tests of Apache Hive and Hadoop 2.8 show that we are able to provide a similar experience reading data in from S3 as Amazon EMR, with its closed-source fork/rewrite of S3," said Steve Loughran, member of the Apache Hadoop Project Management Committee.

Hailed as a "Swiss army knife of the 21st century" by the Media Guardian Innovation Awards  and "the most important software you’ve never heard of…helped enable both Big Data and Cloud computing" by author Thomas Friedman, Apache Hadoop is used by an array of companies such as Alibaba, Amazon Web Services, AOL, Apple, eBay, Facebook, foursquare, IBM, HP, LinkedIn, Microsoft, Netflix, The New York Times, Rackspace, SAP,  Tencent, Teradata, Tesla Motors, Uber, and Twitter. Yahoo, an early pioneer, hosts the world's largest known Hadoop production environment to date, spanning more than 38,000 nodes.

Catch Apache Hadoop in action at DataWorks Summit 13-15 June 2017 in San Jose, CA.

Availability and Oversight
Apache Hadoop software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Hadoop, visit http://hadoop.apache.org/ and https://twitter.com/hadoop

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 680 individual Members and 6,000 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Capital One, Cash Store, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Microsoft, ODPi, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, Target, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Hadoop", "Apache Hadoop", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Wednesday February 08, 2017

The Apache Software Foundation Announces Apache® Ranger™ as a Top-Level Project

Big Data security management framework for the Apache Hadoop ecosystem in use at ING, Protegrity, and Sprint, among other organizations.

Forest Hill, MD —8 February 2017— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Ranger™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.

The latest addition to the ASF’s more than three dozen projects in Big Data, Apache Ranger is a centralized framework used to define, administer and manage security policies consistently across Apache Hadoop components. Ranger also offers the most comprehensive security coverage, with native support for numerous Apache projects, including Atlas (incubating), HBase, HDFS, Hive, Kafka, Knox, NiFi, Solr, Storm, and YARN. 

"Graduating to a Top-Level Project reflects the maturity and growth of the Ranger Community," said Selvamohan Neethiraj, Vice President of Apache Ranger. "We are pleased to celebrate a great milestone and officially play an integral role in the Apache Big Data ecosystem."

Apache Ranger provides a simple and effective way to set access control policies and audit the data access across the entire Hadoop stack by following industry best practices. One of the key benefits of Ranger is that access control policies can be managed by security administrators from a single place and consistently across hadoop ecosystem. Ranger also enables the community to add new systems for authorization even outside Hadoop ecosystem, with a robust plugin architecture, that can be extended with minimal effort. In addition, Apache Ranger provides many advanced features, such as:
  • Ranger Key Management Service (compatible with Hadoop’s native KMS API to store and manage encryption keys for HDFS Transparent Data Encryption);
  • Dynamic column masking and row filtering;
  • Dynamic policy conditions (such as prohibition of toxic joins);
  • User context enrichers (such as geo-location and time of day mappings); and
  • Classification or tag based policies for Hadoop ecosystem components via integration with Apache Atlas.

"As early adopters of Apache Ranger and having contributed to Apache Ranger, we have come to rely upon Apache Ranger as a key part of our security infrastructure for data," said Ferd Scheepers, Chief Information Architect at ING. "We are therefore pleased to learn that the project has now graduated to a TLP project through the efforts of the Apache community. We believe that Apache Ranger represents the best-in-class Open Source security framework for authorization, encryption management, and auditing across Hadoop ecosystem. We laud the community's efforts in building an extensible and enterprise grade architecture for Apache Ranger, and for innovative features such as tag or classification based security (built in conjunction with Apache Atlas). We congratulate the Apache Ranger community on achieving this significant milestone and are confident Apache Ranger will evolve into the de-facto standard for security stack across the Hadoop ecosystem."

"As heavy users of Apache Ranger in production, we are pleased to see the project become a TLP through validation across community efforts," said Timothy R. Connor, Big Data & Advanced Analytics Manager at Sprint. "Apache Ranger has built a next generation ABAC model for authorization along with a robust data-centric Open Source security framework supporting advanced security capabilities such as dynamic row filtering and column masking. All of these point to Apache Ranger maturing into a robust and comprehensive security product for authorization, encryption management and auditing through the Apache community."

"It's great to see Apache Ranger become a TLP," said Dominic Sartorio, Senior Vice President of Products & Development at Protegrity. "Apache Ranger's comprehensive auditing and broad authorization coverage across the Hadoop ecosystem, along with its highly scalable and extensible architecture and rich set of APIs, integrates very well with Protegrity's fine grained data protection capabilities. Our continued collaboration with the Apache Ranger community will help meet the data security requirements of the next generation of enterprise-grade production Hadoop deployments."

"As organizations entrust their enterprise data to Open Source data platforms such as Apache Hadoop, there is a critical need to use the most innovative techniques to safeguard this data," said Alan Gates, Co-Founder of HortonWorks and Apache Ranger incubation mentor. "Apache Ranger community has taken the original, proprietary code base and used it to build a new and successful Apache project that employs an attribute-based approach to define and enforce authorization policies. This modern approach is a combination of subject, action, resource, and environment and goes beyond role-based access control techniques exclusively based on organizational roles - permissions mapping. It has been a pleasure to be their mentor in this process and help them learn the Apache way."

"More and more users are adopting Apache Ranger to secure data in the Hadoop ecosystem," added Neethiraj. "We look forward to welcoming new Ranger users to our mailing lists and community events."

Availability and Oversight
Apache Ranger software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For project updates, downloads, documentation, and ways to become involved with Apache Ranger, visit https://ranger.apache.org/ and @ApacheRanger.

About the Apache Incubator
The Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 620 individual Members and 5,900 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Capital One, Cash Store, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Microsoft, OPDi, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, Target, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Ranger", "Apache Ranger", "HBase", "Apache HBase", "HDFS", "Apache HDFS", "Hive", "Apache Hive", "Kafka", "Apache Kafka", "Knox", "Apache Knox", "NiFi", "Apache NiFi", "Solr", "Apache Solr", "Storm", "Apache Storm", "YARN", "Apache YARN", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #


Tuesday September 20, 2016

The Apache Software Foundation Announces Apache® Kudu™ v1.0

Open Source columnar storage engine for the Apache Hadoop ecosystem in use at Xiaomi, JD Mall, and RMS, among others.

Forest Hill, MD —20 September 2016— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today the availability of Apache® Kudu™ v1.0, the Open Source columnar storage engine built for the Apache Hadoop® ecosystem.

Apache Kudu is designed to enable flexible, high-performance analytic pipelines.Optimized for lightning-fast scans, Kudu is particularly well suited to hosting time-series data and various types of operational data. In addition to its impressive scan speed, Kudu supports many operations available in traditional databases, including real-time insert, update, and delete operations. Kudu enables a "bring your own SQL" philosophy, and supports being accessed by multiple different query engines including such other Apache projects as Drill™, Spark™, and Impala (incubating).

"The Apache Kudu 1.0 release represents a major milestone for the project," said Todd Lipcon, Vice President of Apache Kudu. "One year after the first public beta, the community is confident that Kudu is ready for production for critical business use cases."

Apache Kudu 1.0 is the project's first milestone release since first joining The Apache Software Foundation a year ago, and includes a number of important features that include:
  • Support for redundant and highly available Kudu Master nodes;
  • Support for manual management of range partitioning, critical for time series workloads;
  • Rewritten integration with Apache Spark, including Spark SQL and Data Frame APIs;
  • An officially supported client library for Python; and
  • Substantial performance improvements both for random access and analytic workloads.

These features, along with hundreds of other improvements, bug fixes, and optimizations, represent the work of more than 40 contributors in the Apache community.

Apache Kudu is in use at numerous organizations around the world, spanning industries such as retail, online service delivery, risk management, and digital advertising. Early users of Kudu include Xiaomi (the world's fourth largest smart-phone maker), JD Mall (China's largest B2C online retailer), and RMS (the market leader in catastrophe risk modelling).

After three years of prototyping and development, Kudu was first unveiled to the world at Strata/Hadoop World NYC in September, 2015. Several months later, Kudu was submitted to the Apache Incubator, where the project began to attract a community of active developers and users. In July, 2016, Kudu graduated as an Apache Top-Level Project.

"Kudu 1.0 is the most performant, full-featured, and stable release of Kudu yet. Every day we see new users joining the community, deploying Kudu alongside other Apache projects such as Impala and Spark to solve valuable real-time use cases," added Lipcon. "Kudu expands the Apache Hadoop ecosystem's capabilities, enabling real-time data ingestion and updates while also serving high performance analytics with a substantially simplified architecture."

"The availability of Kudu 1.0 is an exciting milestone and my data science team is eager to evaluate it. We do a lot of work with time series workflows in science data systems and the speed-ups there should really help in our deployment of Kudu," said Chris Mattmann, Chief Architect in the Instrument and Science Data Systems Section at NASA Jet Propulsion Laboratory, and member of the Apache Kudu Project Management Committee.

The Apache Kudu project welcomes contributions and community participation through mailing lists, a Slack channel, face-to-face MeetUps, and other events. Catch Apache Kudu in action at Strata/Hadoop World, 26-29 September in New York City, where engineers from Cloudera, Comcast Xfinity, and GE Digital will present sessions related to Kudu.

Availability and Oversight
Apache Kudu software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Kudu, visit http://kudu.apache.org/ and https://twitter.com/ApacheKudu

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 550 individual Members and 5,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Capital One, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSIGMA, LeaseWeb, Microsoft, OPDi, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Kudu", "Apache Kudu", "Drill", "Apache Drill", "Spark", "Apache Spark", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Wednesday July 27, 2016

Apache Software Foundation Announces Apache® Twill™ as a Top-Level Project

Open Source abstraction layer over Apache Hadoop® YARN simplifies developing distributed Hadoop applications.

Forest Hill, MD –27 July 2016– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Twill™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.

Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed Hadoop applications, allowing developers to focus more on their application logic.

"The Twill community is excited to graduate from the Apache Incubator to a Top-Level Project," said Terence Yim, Vice President of Apache Twill and Software Engineer at Cask. "We are proud of the innovation, creativity and simplicity Twill demonstrates. We are also very excited to bring a technology so versatile in Hadoop into the hands of every developer in the industry."

Apache Twill provides rich built-in features for common distributed applications for development, deployment, and management, greatly easing Hadoop cluster operation and administration.

"Enterprises use big data technologies - and specifically Hadoop - to drive more value," said Patrick Hunt, member of the Apache Software Foundation and Senior Software Engineer at Cloudera. "Apache Twill helps streamline and reduce complexity of developing distributed applications and its graduation to an Apache Top-Level Project means more people will be able to take advantage of Apache Hadoop YARN more easily."

"This is an exciting and major milestone for Apache Twill," said Keith Turner, member of the Apache Fluo (incubating) Project Management Committee, which used Twill in the development of Fluo, an Open Source project that makes it possible to update the results of a large-scale computation, index, or analytic as new data is discovered. "Early in development, we knew we needed a standard way to launch Fluo across a cluster, and we found Twill. With Twill, we quickly and easily had Fluo running across many nodes on a cluster." 

Apache Twill is in production by several organizations across various industries, easing distributed Hadoop application development and deployment.

Twill originated at Cask in early 2013. After 7 major releases, the project was submitted to the Apache Incubator in November of 2013.

"Apache Twill has come a long way through The Apache Software Foundation, and we're thrilled it has become an ASF Top-Level Project," said Nitin Motgi, CTO of Cask. "Apache Twill has become a key component behind the Cask Data Application Platform (CDAP), using YARN containers and Java threads as the processing abstraction. CDAP is an Open Source integration and application platform that makes it easy for developers and organizations to quickly build, deploy and manage data applications on Apache Hadoop and Apache Spark."

"The Apache Twill community worked extremely well within the incubator environment, developing and collaborating openly to follow The Apache Way," said Henry Saputra, ASF Member and member of the Apache Twill Project Management Committee. "There is a tremendous demand for effective APIs and virtualization for developing big data applications and Apache Twill fills that need perfectly. We’re looking forward to continuing the journey with Apache Twill as a Top-Level Project."

Catch Apache Twill in action at:
  • JavaOne, 18-22 September 2016 in San Francisco
  • Strata+Hadoop World, 27-29 September 2016 in New York City
Availability and Oversight
Apache Twill software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Twill, visit http://twill.apache.org/ and follow @ApacheTwill

About the Apache Incubator
The Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 550 individual Members and 5,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Microsoft, OPDi, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF

©The Apache Software Foundation. "Apache", "Twill", "Apache Twill", "Hadoop", "Apache Hadoop", "Apache Hadoop YARN", "Spark", "Apache Spark", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Tuesday July 26, 2016

The Apache Software Foundation Announces Apache® Kudu™ as a Top-Level Project

Open Source columnar storage engine enables fast analytics across the Internet of Things, time series, cybersecurity, and other Big Data applications in the Apache Hadoop ecosystem

Forest Hill, MD –25 July 2016– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Kudu™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.

Apache Kudu is an Open Source columnar storage engine built for the Apache Hadoop ecosystem designed to enable flexible, high-performance analytic pipelines.

"Under the Apache Incubator, the Kudu community has grown to more than 45 developers and hundreds of users," said Todd Lipcon, Vice President of Apache Kudu and Software Engineer at Cloudera. "Recognizing the strong Open Source community is a testament to the power of collaboration and the upcoming 1.0 release promises to give users an even better storage layer that complements Apache HBase and HDFS."

Optimized for lightning-fast scans, Kudu is particularly well suited to hosting time-series data and various types of operational data. In addition to its impressive scan speed, Kudu supports many operations available in traditional databases, including real-time insert, update, and delete operations. Kudu enables a "bring your own SQL" philosophy, and supports being accessed by multiple different query engines including such other Apache projects as Drill, Spark, and Impala (incubating).

Apache Kudu is in use at diverse companies and organizations across many industries, including retail, online service delivery, risk management, and digital advertising.

"Using Apache Kudu alongside interactive SQL tools like Apache Impala (incubating) has allowed us to deploy a next-generation platform for real-time analytics and online reporting," said Baoqiu Cui, Chief Architect at Xiaomi. "Apache Kudu has been deployed in production at Xiaomi for more than six months and has enabled us to improve key reliability and performance metrics for our customers. Kudu's graduation to a Top-Level Project allows companies like ours to operate a hybrid architecture without complexity. We look forward to continuing to contribute to its success."

"We are already seeing the many benefits of Apache Kudu. In fact we're using its combination of fast scans and fast updates for upcoming releases of our risk solutions," said Cory Isaacson, CTO at Risk Management Solutions, Inc. "Kudu is performing well, and RMS is proud to have contributed to the project’s integration with Apache Spark."

"The Internet of Things, cybersecurity and other fast data drivers highlight the demands that real-time analytics place on Big Data platforms," said Arvind Prabhakar, Apache Software Foundation member and CTO of StreamSets. "Apache Kudu fills a key architectural gap by providing an elegant solution spanning both traditional analytics and fast data access. StreamSets provides native support for Apache Kudu to help build real-time ingestion and analytics for our users."

"Graduation to a Top-Level Project marks an important milestone in the Apache Kudu community, but we are really just beginning to achieve our vision of a hybrid storage engine for analytics and real-time processing," added Lipcon. "As our community continues to grow, we welcome feedback, use cases, bug reports, patch submissions, documentation, new integrations, and all other contributions."

The Apache Kudu project welcomes contributions and community participation through mailing lists, a Slack channel, face-to-face MeetUps, and other events. Catch Apache Kudu in action at Strata + Hadoop World, 26-29 September 2016 in New York. 

Availability and Oversight
Apache Kudu software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For project updates, downloads, documentation, and ways to become involved with Apache Kudu, visit http://kudu.apache.org/ , @ApacheKudu, and http://kudu.apache.org/blog/.

About the Apache Incubator
The Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 550 individual Members and 5,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Microsoft, OPDi, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Kudu", "Apache Kudu", "Drill", "Apache Drill", "Hadoop", "Apache Hadoop", "Apache Impala (incubating)", "Spark", "Apache Spark", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Wednesday May 25, 2016

The Apache Software Foundation Announces Apache® Zeppelin™ as a Top-Level Project

Open Source Big Data analytics and visualization tool for distributed, interactive, and collaborative systems using Apache Flink, Apache Hadoop, Apache Spark, and more.

Forest Hill, MD –25 May 2016– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Zeppelin™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.

Apache Zeppelin is a modern, web-based notebook that enables interactive data analytics. Notebooks help developers, data scientists, and related users to handle data efficiently without worrying about command lines and cluster details.

"The Zeppelin community is pleased to graduate from the Apache Incubator," said Lee Moon Soo, Vice President of Apache Zeppelin. "With 118 worldwide contributors and widespread adoption in numerous commercial products, we are proud to officially be a part of the Apache Big Data ecosystem."

Zeppelin's collaborative data analytics and visualization capabilities makes data exploration, visualization, sharing, and collaboration easy over distributed, general-purpose data processing systems that use Apache Flink, Apache Hadoop, and Apache Spark, among other Big Data platforms.

Apache Zeppelin is:
  • Multi-purpose --features data ingestion, exploration, analysis, visualization, and collaboration;
  • Robust --supports 20+ more backend systems, including Apache Spark, Apache Flink, Apache Hive, Python, R, and any JDBC (Java Database Connectivity);
  • Easy to deploy --built on top of modern Web technologies (provides built-in Apache Spark integration, eliminating the need to build a separate module, plugin, or library);
  • Easy to use --with built-in visualizations and dynamic forms;
  • Flexible --allows users to mix different languages, exchange data between backends, adjust the layout;
  • Extensible --with pluggable architecture for interpreters, notebook storages, authentication, and visualizations (in progress); and
  • Advanced --allows interaction between custom visualizations and cluster resources

"With Apache Zeppelin, a wide range of users can make beautiful data-driven, interactive, and collaborative documents with SQL, Scala, and more," added Soo.

Apache Zeppelin is in use at an array of organizations and solutions, including Amazon Web Services, Hortonworks, JuJu, and Twitter, among others. 

"Congratulations to Apache Zeppelin community on graduation," said Tim Hall, Vice President of Product Management at Hortonworks. "Several members of our team have been working over the past year in the Zeppelin community 
to make it enterprise ready. We are excited to be associated with this community and look forward to helping our customers get the best insights out of their data with Apache Zeppelin."

"Apache Zeppelin is becoming an important tool at Twitter for creating and sharing interactive data analytics and visualizations," said Prasad Wagle, Technical Lead in the Data Platform team at Twitter. "Since it integrates seamlessly with all the popular data analytics engines, it is very easy to create and share reports and dashboards. With its extensible architecture and a vibrant Open Source community, I am looking forward to Apache Zeppelin advancing the state of the art in data analytics and visualization."

"Apache Zeppelin is the major user-facing piece of Memcore’s in-memory data processing Cloud offering. Building a technology stack might be quite exciting engineering challenge, however, if users can’t visualize and work with the data conveniently, it is as good as not having the data at all. Apache Zeppelin enables efficient user acquisition by anyone trying to build new products or service offerings in the Big- and Fast- Data markets, making innovations, collaboration, and development easier for anyone," said Dr. Konstantin Boudnik, Founder and CEO of Memcore.io. "I am very excited to see Apache Zeppelin graduating as an ASF Top Level Project. This shows that more people are joining the community, bringing the project to a new level, and adding more integration points with existing data analytics and transactional software systems. This directly benefits the community at-large."

Apache Zeppelin originated in 2013 at NFLabs as Peloton, a commercial data analytics product. Since entering the Apache Incubator in December 2014, the project has had three releases, and twice participated in Google Summer of Code under the Apache umbrella.

"It was an honor to help with the incubation of Zeppelin," said Ted Dunning, Vice President of the Apache Incubator. "I have been very impressed with the Zeppelin community and the software they have built. I see Apache Zeppelin being adopted all over the place where people need to apply a notebook style to a wide variety of kinds of computing."

Catch Apache Zeppelin in action during Berlin Buzzwords, 7 June 2016 https://s.apache.org/mV8E

Availability and Oversight
Apache Zeppelin software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Zeppelin, visit http://zeppelin.apache.org/ and https://twitter.com/ApacheZeppelin

About the Apache Incubator
The Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 550 individual Members and 5,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Microsoft, OPDi, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Zeppelin", "Apache Zeppelin", "Ambari", "Apache Ambari", "Flink", "Apache Flink", "Hadoop", "Apache Hadoop", "Hive", "Apache Hive", "Spark", "Apache Spark", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Monday April 25, 2016

The Apache Software Foundation Announces Apache® Apex™ as a Top-Level Project

Open Source enterprise-grade unified Big Data stream and batch processing engine for Apache Hadoop in use at GE, Silver Spring Networks, and more.

Forest Hill, MD –25 April 2016– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Apex™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.

Apache Apex is a large scale, high throughput, low latency, fault tolerant, unified Big Data stream and batch processing platform for the Apache Hadoop® ecosystem.

"It is very exciting to see Apex after nearly 4 years since inception becoming an ASF top-level project," said Thomas Weise, Vice President of Apache Apex. "It opens the strong capabilities and potential of the platform to a wider audience and we’re looking forward to a growing community to continue driving innovation in the stream processing space."

Recognized by InfoWorld for its "blazing speed and simplified programmability," Apex works in conjunction with Apache Hadoop YARN, a resource management platform for working with Hadoop clusters.

Apex was originally created at DataTorrent Inc. in 2012 (coinciding with the first alpha release of YARN), and entered the Apache Incubator in August 2015.

Apex enables streaming analytics on Apache Hadoop with an enterprise-grade platform. It has been built to leverage the underlying infrastructure provided by YARN and HDFS (Hadoop Distributed File System), including resource management, multi-tenancy and security. 

Faster to Deployment
Apache Apex meets the demands of today's Big Data applications with real-time reporting, monitoring, and learning with millisecond data point precision. Its pipeline processing architecture can be used for real-time and batch processing in a unified architecture. Apex is highly performant, linearly scalable, fault tolerant, stateful, secure, distributed, easily operable with low latency, no data loss, and exactly-once semantics.

Apex streamlines development and productization of Hadoop applications and lowers the barrier-to-entry by enabling developers to write or re-use generic Java code, minimizing the specialized expertise needed to write Big Data applications. This allows organizations to maximize developer productivity, accelerate development of business logic, and reduce time to market.

"Apache Apex is an example of the latest generation of advanced stream processing software that adds significant technology and capabilities over previous options," said Ted Dunning, Vice President of the Apache Incubator, Apache Apex Incubator Mentor, and Chief Application Architect at MapR Technologies. "That this project came to Apache and is now a fully fledged project is very exciting."

Apex comes with a comprehensive library of reusable operators (functional building blocks) that can be leveraged to quickly create new and non-trivial applications. This also includes connectors to integrate with many external systems that include message buses, databases, file systems and social media feeds. Examples are Apache Cassandra, Apache HBase, JDBC, and Apache Kafka.

"Apache Apex is a battle-hardened technology, processing huge volumes of streaming data at some of the world’s largest enterprise and Internet companies," said technology advisor Eric Baldeschwieler. "Its successful Apache incubation has provided a tremendous boost to Apex, bringing many new members to its community of users and developers."

Enterprise Grade Unified Stream and Batch Processing
Apache Apex use cases include ingestion, fast real-time analytics, data movement, Extract-Transform-Load (ETL), fast batch, alerts, and real-time actions across diverse industries such as programmatic advertising, telecommunications, Internet of Things (IoT), and financial services.

"We are in the process of leveraging Big Data technologies to transform business processes and drive more value," explained Reid Levesque, Head of Solution Engineering at a financial services company. "We chose Apex to help us in this journey to do real-time ingestion and analytics on our various data sources and now we are proud to see it graduate to an Apache top level project."

Apex powers Big Data projects in production at numerous large enterprises such as GE Predix (IoT Cloud platform for industrial data and analytics); PubMatic (marketing automation software platform for publishers), and Silver Spring Networks (IoT solutions for smart cities).

"We at GE Predix data services have used Apex for our data pipeline product and look forward to our continued usage and contribution," said Parag Goradia, Executive Director of Predix Data Services. "We had great experience with Apache Apex and its capabilities. We believe Apex has a bright future as it will continue to solve big problems in the big data industry. We are proud to be associated with this project and excited that it is now in top level status."

"The Apex community has done a great job throughout the incubation process. They have built a robust community and demonstrated a firm understanding of The Apache Way," said P. Taylor Goetz, ASF Member and Apache Apex Incubator Mentor. "I'm pleased to see Apex graduate to a top-level project. These are exciting times in the world of stream processing."

"Congratulations to the Apache Apex community for working successfully through the incubation process and becoming part of the greater Apache Hadoop ecosystem," added Dunning.

Catch Apache Apex in action at:

Availability and Oversight
Apache Apex software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Apex, visit http://apex.apache.org/ and https://twitter.com/ApacheApex

About the Apache Incubator
The Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 550 individual Members and 5,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Microsoft, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF

© The Apache Software Foundation. "Apache", "Apex", "Apache Apex", "Cassandra", "Apache Cassandra", "HBase", "Apache HBase", "Hadoop", "Apache Hadoop", "Kafka", "Apache Kafka", "YARN", "Apache YARN", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Friday April 01, 2016

Announcing creation of the Hadoop Software Foundation

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today the creation of the independent Hadoop Software Foundation (HSF), the new governing body for big data's biggest software program, Apache® Hadoop®.

[Read More]

Tuesday May 19, 2015

The Apache Software Foundation Announces Apache™ Drill™ 1.0

Thousands of users adopt Open Source, enterprise-grade, schema-free SQL query engine for Apache Hadoop®, NoSQL, and Cloud storage.

Forest Hill, MD --19 May 2015-- The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today the availability of Apache™ Drill™ 1.0, the schema-free SQL query engine for Apache Hadoop®, NoSQL, and Cloud storage.

"The production-ready 1.0 release represents a significant milestone for the Drill project," said Tomer Shiran, member of the Apache Drill Project Management Committee. "It is the outcome of almost three years of development involving dozens of engineers from numerous companies. Apache Drill's flexibility and ease-of-use have attracted thousands of users, and the enterprise-grade reliability, security and performance in the 1.0 release will further accelerate adoption."

With the exponential growth of data in recent years, and the shift towards rapid application development, new data is increasingly being stored in non-relational, schema-free datastores including Hadoop, NoSQL and Cloud storage. Apache Drill revolutionizes data exploration and analytics by enabling analysts, business users, data scientists and developers to explore and analyze this data without sacrificing the flexibility and agility offered by these datastores. Drill processes the data in-situ without requiring users to define schemas or transform data.

"Drill introduces the JSON document model to the world of SQL-based analytics and BI" said Jacques Nadeau, Vice President of Apache Drill. "This enables users to query fixed-schema, evolving-schema and schema-free data stored in a variety of formats and datastores. The architecture of relational query engines and databases is built on the assumption that all data has a simple and static structure that’s known in advance, and this 40-year-old assumption is simply no longer valid. We designed Drill from the ground up to address the new reality.”

Apache Drill's architecture is unique in many ways. It is the only columnar execution engine that supports complex and schema-free data, and the only execution engine that performs data-driven query compilation (and re-compilation, also known as schema discovery) during query execution. These unique capabilities enable Drill to achieve record-breaking performance with the flexibility offered by the JSON document model.

The business intelligence (BI) partner ecosystem is embracing the power of Apache Drill. Organizations such as Information Builders, JReport (Jinfonet Software), MicroStrategy, Qlik®, Simba, Tableau, and TIBCO, are working closely with the Drill community to interoperate BI tools with Drill through standard ODBC and JDBC connectivity. This collaboration enables end users to explore data by leveraging sophisticated visualization tools and advanced analytics.

"We've been using Apache Drill for the past six months," said Andrew Hamilton, CTO of Cardlytics. "Its ease of deployment and use along with its ability to quickly process trillions of records has made it an invaluable tool inside Cardlytics. Queries that were previously insurmountable are now common occurrence. Congratulations to the Drill community on this momentous occasion." 

"Drill's columnar execution engine and optimizer take full advantage of Apache Parquet's columnar storage to achieve maximum performance," said Julien Le Dem, Technical Lead of Analytics Data Pipeline at Twitter and Vice President of Apache Parquet. "The Drill team has been a key contributor to the Parquet project, including recent enhancements to Parquet types and vectorization. The Drill team’s involvement in the Parquet community is instrumental in driving the standard."

"Apache Drill 1.0 raises the bar for secure, reliable and scalable SQL-on-Hadoop," said Piyush Bhargava, distinguished engineer, IT, Cisco Systems. "Because Drill integrates with existing data virtualization and visualization tools, we expect it will improve adoption of self-service data exploration and large-scale BI queries on our advanced Hadoop platform at Cisco."

"MicroStrategy recognized early on the value of Apache Drill and is one of the first analytic platforms to certify Drill," said Tim Lang, senior executive vice president and chief technology officer at MicroStrategy Incorporated.  "Because Drill is designed to be used with a minimal learning curve, it opens up more complex data sets to the end user who can immediately visualize and analyze new information using MicroStrategy’s advanced capabilities."

"Apache Drill closes a gap around self-service SQL queries in Hadoop, especially on complex, dynamic NoSQL data types," said Mike Foster, Strategic Alliances Technology Officer at Qlik.  "Drill's performance advantages for Hadoop data access, combined with the Qlik associative experience, enables our customers to continue discovering business value from a wide range of data. Congratulations to the Apache Drill community."

"Apache Drill empowers people to access data that is traditionally difficult to work with," said Jeff Feng, product manager, Tableau.  "Direct access within a centralized data repository and without pre-generating metadata definitions encourages data democracy which is essential for data-driven organizations. Additionally, Drill's instant and secure access to complex data formats, such as JSON, opens up extended analytical opportunities."

"Congratulations to the Apache Drill community on the availability of 1.0," said Karl Van den Bergh, Vice President, Products and Cloud at TIBCO. "Drill promises to bring low-latency access to data stored in Hadoop and HBase via standard SQL semantics. This innovation is in line with the value of Fast Data analysis, which TIBCO customers welcome and appreciate."

"The community's accomplishment is a testament to The Apache Software Foundation's ability to bring together diverse companies to work towards a common goal. None of this would have been possible without the contribution of engineers with advanced degrees and experience in relational databases, data warehousing, MPP, query optimization, Hadoop and NoSQL," added Nadeau. "Our community's strength is what will solidify Apache Drill as a key data technology for the next decade. We welcome interested individuals to learn more about Drill by joining the community's mailing lists, attending upcoming talks by Drill code committers at various conferences including Hadoop Summit, NoSQL Now, Hadoop World, or at a local Apache Drill MeetUp."

Availability and Oversight
Apache Drill 1.0 is available immediately as a free download from http://drill.apache.org/download/. Documentation is available at http://drill.apache.org/docs/. As with all Apache products, Apache Drill software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the project's day-to-day operations, including community development and product releases. For ways to become involved with Apache Drill, visit http://drill.apache.org/ and @ApacheDrill on Twitter.

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 500 individual Members and 4,500 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Bloomberg, Budget Direct, Cerner, Citrix, Cloudera, Comcast, Facebook, Google, Hortonworks, HP, IBM, InMotion Hosting, iSigma, Matt Mullenweg, Microsoft, Pivotal, Produban, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ or follow @TheASF on Twitter.

© The Apache Software Foundation. "Apache", "Apache Drill", "Drill", "Apache Hadoop", "Hadoop", "Apache Parquet", "Parquet", and "ApacheCon", are registered trademarks or trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners.

# # #

Monday April 27, 2015

The Apache Software Foundation Announces Apache™ Parquet™ as a Top-Level Project

Open Source storage format for the Apache™ Hadoop® ecosystem in use at Cloudera, NASA, Netflix, Stripe, and Twitter, among other organizations 

Forest Hill, MD --27 April 2015-- The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache™ Parquet™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles.

"The incubation process at Apache has been fantastic and really the last step of making Parquet a community driven standard fully integrated within the greater Hadoop ecosystem," said Julien Le Dem, Vice President of Apache Parquet.

Apache Parquet is an Open Source columnar storage format for the Apache™ Hadoop® ecosystem, built to work across programming languages and much more:
  • processing frameworks (MapReduce, Apache Spark, Scalding, Cascading, Crunch, Kite)
  • data models (Apache Avro, Apache Thrift, Protocol Buffers, POJOs)
  • query engines (Apache Hive, Impala, HAWQ, Apache Drill, Apache Tajo, Apache Pig, Presto, Apache Spark SQL)

"At Twitter, Parquet has helped us scale our big data usage by in some cases reducing storage requirements by one third on large datasets as well as scan and deserialization time. This translated into hardware savings as well as reduced latency for accessing the data. Furthermore, Parquet being integrated with so many tools creates opportunities and flexibility regarding query engines," said Chris Aniszczyk, Head of Open Source at Twitter. "Finally, it's just fantastic to see it graduate to a top-level project and we look forward to further collaborating with the Apache Parquet community to continually improve performance."

"Parquet's integration with other object models, like Avro and Thrift, has been a key feature for our customers," said Ryan Blue, Software Engineer at Cloudera. "They can take advantage of columnar storage without changing the classes they already use in their production applications."

"At Netflix, Parquet is the primary storage format for data warehousing. More than 7 petabytes of our 10+ Petabyte warehouse is Parquet formatted data that we query across a wide range of tools including Apache Hive, Apache Pig, Apache Spark, PigPen, Presto, and native MapReduce. The performance benefit of columnar projection and statistics is a game changer for our big data platform," said Daniel Weeks, Software Engineer at Netflix. "We look forward to working with the Apache community to advance the state of big data storage with Parquet and are excited to see the project graduate to full Apache status."

"Stripe's data warehouse has been built on Parquet from the beginning," said Avi Bryant, Engineering Manager at Stripe. "Every aspect of our pipeline, from data import to machine learning to adhoc SQL analysis, uses Apache Parquet as the common interchange format."

"I was extremely happy to see Parquet arrive as an Incubator project," said Chris Mattmann, Apache Parquet Incubator Mentor, and Chief Architect, Instrument and Science Data Systems Section at NASA Jet Propulsion Laboratory. "After talking with some in its community there was a real match with this columnar data format technology and its community with the way that we do things here at the ASF. Parquet has had an exemplar Incubation, and the project has big things ahead of it. I am encouraging my Data Science Team at NASA to evaluate it for data representation especially as it relates to our science holdings in Earth, planetary and space sciences, and astrophysics."

Catch Apache Parquet in action at the Hadoop Summit, 9-11 June 2015 in San Jose, California. The Apache Parquet project welcomes contributions and community participation through mailing lists, face-to-face MeetUps, and user events. For more information, visit http://parquet.apache.org/community/

Availability and Oversight
Apache Parquet software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Parquet, visit http://parquet.apache.org/ and https://twitter.com/ApacheParquet

About the Apache Incubator
The Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/.

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 500 individual Members and 4,500 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Bloomberg, Budget Direct, Cerner, Citrix, Cloudera, Comcast, Facebook, Google, Hortonworks, HP, IBM, InMotion Hosting, iSigma, Matt Mullenweg, Microsoft, Pivotal, Produban, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ or follow @TheASF on Twitter.

© The Apache Software Foundation. "Apache", "Avro", "Apache Avro", "Drill", "Apache Drill", "Hadoop", "Apache Hadoop", "Parquet", "Apache Parquet", "Pig", "Apache Pig", "Spark", "Apache Spark", "Thrift", "Apache Thrift", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation