The Apache Software Foundation Blog

Wednesday February 03, 2021

The Apache Software Foundation Announces Apache® DataSketches™ as a Top-Level Project

Open Source high-performance Big Data streaming algorithm library in use at Nielsen Identity, Permutive, Splice Machine, and Verizon Media, among others.

Wilmington, DE —3 February 2021— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today Apache® DataSketches™ as a Top-Level Project (TLP).

Apache DataSketches is a highly performant Big Data analysis library for scalable approximate algorithms. The project originated at Yahoo in 2012, was open-sourced in 2015, and entered the Apache Incubator in March 2019.

"We are excited to be part of the ASF," said Lee Rhodes, Vice President of Apache DataSketches. "We have learned a great deal from the incubation process and look forward to working with new users of our library that want to take advantage of sketching technology."

Apache DataSketches’s library of specialized streaming algorithms —known as sketches— comprise small data structures that process data at massive scale. Sketches are ideal for queries that cannot afford the time or huge compute resources needed to generate exact results. Where approximate results are acceptable, sketches are the only viable alternative for interactive queries with real-time analysis. Apache DataSketches is:

  • Fast —produces approximate results at orders of magnitude faster than traditional methods -- user configurable size vs accuracy tradeoff;
  • Efficient —sketch algorithms process data in a single pass for both real-time and batch;
  • Mergeable —allows for parallelization;
  • Optimized for large-scale computing environments that process Big Data —such as Apache Hadoop, Apache Spark, Apache Druid, Apache Hive, Apache Pig, PostgreSQL;
  • Binary compatible across multiple languages and platforms —available in Java, C++, and Python;
  • Expanded Analysis —including count distinct with set operations, quantiles, most frequent items (heavy hitters), matrix computations, and more; and
  • Mathematically defined and proven error properties —provides a priori and a posteriori error estimation and upper and lower bounds with statistically derived confidence intervals.

Apache DataSketches is used in large-scale computing environments such as Nielsen Identity, Permutive, Splice Machine, and Verizon Media, among others, as well as Apache Druid and Apache Pinot (incubating).

"The Apache DataSketches project takes powerful algorithms for data summarization and analysis, and makes them available to everyone," said Professor Graham Cormode of the University of Warwick. "While these methods are tremendously useful in practice, their descriptions were previously only in highly technical scientific papers. This project has made robust, dependable and well-documented implementations available to all. Already the library has been used for a wide range of applications, including service quality, monitoring, ad analytics and the sciences."

"Using Apache DataSketches has enabled Apache Druid users to perform common tasks such as quantiles and unique counting in a highly performant and efficient manner," said Gian Merlino, Vice President of Apache Druid. "We have worked closely together over the years to make the power of DataSketches accessible to Apache Druid users, helping us provide real-time analytics at scale."

"Sketches are fundamental to calculating many of our key company metrics," said Tom Miller, Director of Software Development Engineering at Verizon Media. "It allows us to greatly simplify our data processing and reduce storage costs by allowing us to calculate non-additive metrics across user specified dimension combinations at report time instead of having to either retain raw data or pre-calculate for each set of dimensions."

"Combining Apache Druid and DataSketches allows us to provide our customers real-time insights into their target audiences and advertising campaigns," said Yakir Buskilla, Senior Vice President of Research and Development and General Manager Israel at Nielsen Identity. "The ability to evaluate set expressions make the Theta Sketch especially powerful for multi-set cardinality estimation as well as funnel analysis."

“Apache DataSketches has provided us with a solid theoretical foundation upon which we are able to store and process data at scale - in a simple, fast and cost-efficient manner," said David Cromberge, Senior Software Engineer at Permutive. "It has been a pleasure to engage with their creators and community who have been helpful at every step of the way.”

"We use DataSketches's Theta-Sketches for distinct-count aggregations that are used to solve large multi-set cardinality approximation," said Mayank Shrivastava, Committer and member of the Apache Pinot (incubating) Podling Project Management Committee. "The ability to evaluate set expressions make the Theta Sketch especially powerful for multi-set cardinality estimation as well as funnel analysis."

"We welcome those interested in streaming algorithms to visit us, learn about this exciting technology, and contribute to Apache DataSketches to make our project even better," added Rhodes.

Availability and Oversight
Apache DataSketches software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache DataSketches, visit https://datasketches.apache.org .

About the Apache Incubator
The Apache Incubator is the primary entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects enter the ASF through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/ .

About The Apache Software Foundation (ASF)
Established in 1999, The Apache Software Foundation is the world’s largest Open Source foundation, stewarding 227M+ lines of code and providing more than $20B+ worth of software to the public at 100% no cost. The ASF’s all-volunteer community grew from 21 original founders overseeing the Apache HTTP Server to 813 individual Members and 206 Project Management Committees who successfully lead 350+ Apache projects and initiatives in collaboration with nearly 8,000 Committers through the ASF’s meritocratic process known as "The Apache Way". Apache software is integral to nearly every end user computing device, from laptops to tablets to mobile devices across enterprises and mission-critical applications. Apache projects power most of the Internet, manage exabytes of data, execute teraflops of operations, and store billions of objects in virtually every industry. The commercially-friendly and permissive Apache License v2 is an Open Source industry standard, helping launch billion dollar corporations and benefiting countless users worldwide. The ASF is a US 501(c)(3) not-for-profit charitable organization funded by individual donations and corporate sponsors including Aetna, Alibaba Cloud Computing, Amazon Web Services, Anonymous, Baidu, Bloomberg, Budget Direct, Capital One, Cloudera, Comcast, Didi Chuxing, Facebook, Google, Handshake, Huawei, IBM, Microsoft, Pineapple Fund, Red Hat, Reprise Software, Target, Tencent, Union Investment, Verizon Media, and Workday. For more information, visit http://apache.org/ and https://twitter.com/TheASF .

© The Apache Software Foundation. "Apache", "DataSketches", "Apache DataSketches", "Druid", "Apache Druid", "Hadoop", "Apache Hadoop", "Hive", "Apache Hive", "Pig", "Apache Pig", "Pinot (incubating)", "Apache Pinot (incubating)", "Spark", "Apache Spark", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.

# # #

Comments:

Post a Comment:
Comments are closed for this entry.

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation