The Apache Software Foundation Blog
The Apache Software Foundation Announces Apache® Hudi™ as a Top-Level Project
Open Source data lake technology for stream processing on top of Apache Hadoop in use at Alibaba, Tencent, Uber, and more.
Wakefield, MA —4 June 2020— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today Apache® Hudi™ as a Top-Level Project (TLP).
Apache Hudi (Hadoop Upserts Deletes and Incrementals) data lake technology enables stream processing on top of Apache Hadoop compatible cloud stores & distributed file systems. The project was originally developed at Uber in 2016 (code-named and pronounced "Hoodie"), open-sourced in 2017, and submitted to the Apache Incubator in January 2019.
"Learning and growing the Apache way in the incubator was a rewarding experience," said Vinoth Chandar, Vice President of Apache Hudi. "As a community, we are humbled by how far we have advanced the project together, while at the same time, excited about the challenges ahead."
Apache Hudi is used to manage petabyte-scale data lakes using stream processing primitives like upserts and incremental change streams on Apache Hadoop Distributed File System (HDFS) or cloud stores. Hudi data lakes provide fresh data while being an order of magnitude efficient over traditional batch processing. Features include:
- Upsert/Delete support with fast, pluggable indexing
- Transactionally commit/rollback data
- Change capture from Hudi tables for stream processing
- Support for Apache Hive, Apache Spark, Apache Impala and Presto query engines
- Built-in data ingestion tool supporting Apache Kafka, Apache Sqoop and other common data sources
- Optimize query performance by managing file sizes, storage layout
- Fast row based ingestion format with async compaction into columnar format
- Timeline metadata for audit tracking
Apache Hudi is in use at organizations such as Alibaba Group, EMIS Health, Linknovate, Tathastu.AI, Tencent, and Uber, and is supported as part of Amazon EMR by Amazon Web Services. A partial list of those deploying Hudi is available at https://hudi.apache.org/docs/powered_by.html
"We are very pleased to see Apache Hudi graduate to an Apache Top-Level Project. Apache Hudi is supported in Amazon EMR release 5.28 and higher, and enables customers with data in Amazon S3 data lakes to perform record-level inserts, updates, and deletes for privacy regulations, change data capture (CDC), and simplified data pipeline development," said Rahul Pathak, General Manager, Analytics, AWS. “We look forward to working with our customers and the Apache Hudi community to help advance the project."
"At Uber, Hudi powers one of the largest transactional data lakes on the planet in near real time to provide meaningful experiences to users worldwide," said Nishith Agarwal, member of the Apache Hudi Project Management Committee. "With over 150 petabytes of data and more than 500 billion records ingested per day, Uber’s use cases range from business critical workflows to analytics and machine learning."
"Using Apache Hudi, end-users can handle either read-heavy or write-heavy use cases, and Hudi will manage the underlying data stored on HDFS/COS/CHDFS using Apache Parquet and Apache Avro," said Felix Zheng, Lead of Cloud Real-Time Computing Service Technology at Tencent.
"As cloud infrastructure becomes more sophisticated, data analysis and computing solutions gradually begin to build data lake platforms based on cloud object storage and computing resources," said Li Wei, Technical Lead on Data Lake Analytics, at Alibaba Cloud. "Apache Hudi is a very good incremental storage engine that helps users manage the data in the data lake in an open way and accelerate users' computing and analysis."
"Apache Hudi is a key building block for the Hopsworks Feature Store, providing versioned features, incremental and atomic updates to features, and indexed time-travel queries for features," said Jim Dowling, CEO/Co-Founder at Logical Clocks. "The graduation of Hudi to a top-level Apache project is also the graduation of the open-source data lake from its earlier data swamp incarnation to a modern ACID-enabled, enterprise-ready data platform."
"Hudi's graduation to a top-level Apache project is a result of the efforts of many dedicated contributors in the Hudi community," said Jennifer Anderson, Senior Director of Platform Engineering at Uber. "Hudi is critical to the performance and scalability of Uber's big data infrastructure. We're excited to see it gain traction and achieve this major milestone."
"Thus far, Hudi has started a meaningful discussion in the industry about the wide gaps between data warehouses and data lakes. We have also taken strides to bridge some of them, with the help of the Apache community," added Chandar. "But, we are only getting started with our deeply technical roadmap. We certainly look forward to a lot more contributions and collaborations from the community to get there. Everyone’s invited!"
Catch Apache Hudi in action at Virtual Berlin Buzzwords 7-12 June 2020, as well as at MeetUps, and other events.
Availability and Oversight
Apache Hudi software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Hudi, visit http://hudi.apache.org/ and https://twitter.com/apachehudi
About the Apache Incubator
The Apache Incubator is the primary entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects enter the ASF through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/
About The Apache Software Foundation (ASF)
Established in 1999, The Apache Software Foundation (ASF) is the world’s largest Open Source foundation, stewarding 200M+ lines of code and providing more than $20B+ worth of software to the public at 100% no cost. The ASF’s all-volunteer community grew from 21 original founders overseeing the Apache HTTP Server to 765 individual Members and 206 Project Management Committees who successfully lead 350+ Apache projects and initiatives in collaboration with 7,600 Committers through the ASF’s meritocratic process known as "The Apache Way". Apache software is integral to nearly every end user computing device, from laptops to tablets to mobile devices across enterprises and mission-critical applications. Apache projects power most of the Internet, manage exabytes of data, execute teraflops of operations, and store billions of objects in virtually every industry. The commercially-friendly and permissive Apache License v2 is an Open Source industry standard, helping launch billion dollar corporations and benefiting countless users worldwide. The ASF is a US 501(c)(3) not-for-profit charitable organization funded by individual donations and corporate sponsors including Aetna, Alibaba Cloud Computing, Amazon Web Services, Anonymous, Baidu, Bloomberg, Budget Direct, Capital One, CarGurus, Cerner, Cloudera, Comcast, Facebook, Google, Handshake, Huawei, IBM, Indeed, Inspur, Leaseweb, Microsoft, Pineapple Fund, Red Hat, Target, Tencent, Union Investment, Verizon Media, and Workday. For more information, visit http://apache.org/ and https://twitter.com/TheASF
© The Apache Software Foundation. "Apache", "Hudi", "Apache Hudi", "Hadoop", "Apache Hadoop", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.
# # #
Posted at 01:00PM Jun 04, 2020 by Sally in General | |