Apache Hadoop

Thursday August 22, 2019

Hadoop Community Meetup @ Beijing, Aug 2019

Hadoop Community Meetup @ Beijing

Author: Junping Du (Tencent) & Wangda Tan (Cloudera)

Picture01.jpg Picture02.jpg


On Aug 11th 2019, Hadoop developers/users gathered together at Tencent’s Sigma Center office in Beijing to share their latest works, with 12 presentations by engineers from Tencent, Cloudera, Alibaba, Didi, Xiaomi, Meituan, ByteDance (Parent company of TikTok, Toutiao, etc.), JD.com, Huawei. This is also first Hadoop community meetup hosted by Apache Hadoop PMC members.


We received tremendous numbers of participations to the meetup. There’re 200 spots available for registration to attend this meetup in-person, and spots got fully booked in 10 mins. We got 150+ attendees in-person, and 3000+ attendees participated online live sessions.

We have participants from dozens of different companies and universities in person, many of them are flying from Shanghai, Hangzhou, Shenzhen and even San Francisco Bay Area!


1. Hadoop Community Update And Roadmaps

Junping Du @ Tencent and Wangda Tan @ Cloudera talked about Hadoop community updates and roadmaps.


Junping Du @ Tencent



Wangda Tan @ Cloudera

Junping introduced recent trends in the storage field, such as better scalability and moving to cloud. He talked about features like RBF (Router Based Federation), improvements of NameNode scalability, Improvements of cloud connectors and Ozone.

Wangda talked about recent trends in the compute field, such as better scalability, moving to clkoud-native environment, containerization works and support of Machine-Learning use cases. He talked about global scheduling framework for better scheduling throughput and placement quality. Recent containerization works in YARN such as runc, interactive docker shell. And YARN-on-cloud initiatives from community such as autoscaling, graceful decommissions, etc. Wangda also talked about Submarine and its release plans.

At last, Wangda looked back at releases in 2018/2019, and shared tentative release plan of Hadoop in 2019. Such as 3.1.3, 3.2.1 and what’s new coming to 3.3.0.

2. Ozone: Hadoop native object store

Sammi (Yi) Chen @ Tencent talked about native object store project from Hadoop community.


Ozone is a strong-consistent distributed object store service. Like HDFS, Ozone has same level of reliability, consistency and usability. It supports S3 interface, so it is not only useful to on-prem big-data workload. It is also a good option to move big data to cloud.

Sammi talked about architecture of Ozone, and what’s new in Ozone 0.5 release.

3. YARN 3.x in Alibaba

Tao Yang from Alibaba talked about Hadoop use cases in Alibaba. He also talked about how new features in YARN 3.x being used to solve use cases. Tao talked about features like preemption, scheduling, resource over-commitment, scheduling diagnostic, mixed deployment of online/offline workload. Tao also talked about how new features in YARN help to better run Apache Flink on YARN.


Tao talked about many interesting features such as MultiNodeLookupPolicy, which can help schedule jobs on a pluggable node sorter.

4. HDFS Best Practices learned from Didi’s production environment.

Hui Fei from Didi talked about HDFS best practices learned from Didi’s large scale (hundreds of PBs) production environment.

Hui first talked about storage use cases and scale in Didi’s environment. Then Hui talked about functionalities and improvements Didi’s Hadoop team built on top of Hadoop HDFS 2.7.2 such as: Security, NameNode Federation, Balancer, etc.


Hui also talked about the status of upgrading production cluster based on Hadoop 2.7.2 to Hadoop 3.2.0. The primary driver of upgrade is to save storage spaces. Didi wants to use features like Erasure Coding in Hadoop 3.x.

Didi has upgraded a test cluster (100+ nodes) from 2.7.2 to 3.2.0, has a backup cluster with 2k+ nodes run Hadoop 3.1.1 and will rolling upgrade it to 3.2.0. There’s a primary cluster with 10K+ nodes (with 5 namespaces), will start to upgrade to 3.2.0 starting Oct

5. Submarine: A one-stop, cross-platform machine learning platform

Xun Liu @ NetEase and Zhankun Tang @ Cloudera talked about background, existing status and future of Submarine project.


Zhankun Tang @ Cloudera



Xun Liu @ Netease

Machine learning includes many components like data-preprocessing, feature extraction, model training/serving/management, distributed workload management. Submarine project started by Hadoop community is targeted to achieve these goals by focusing on Notebook experiences. With Submarine, data scientists or machine learning engineer don’t need to understand lower-level platform such as YARN, K8s, Docker container.

Zhankun showed a new feature called mini-submarine which allows developers try Submarine locally without installing a YARN cluster.

Xun did demos for:

      Integration of Submarine + Zeppelin notebook.

      New Submarine web UI to allow data scientists to run jobs and manage models, etc. in the unified user experiences.

Xun also talked about companies which are reported using Submarine in production. Such as NetEase, Linkedin, Dahua, Ke.com, JD.com.

6. Hadoop Improvements in Xiaomi

Chen Zhang and Kang Zhou from Xiaomi talked about how Hadoop is being used in Xiaomi. They talked about improvements of HDFS’s performance and scalability; Problems/Solutions when trying to platformize YARN.

For HDFS side, Chen talked about their improvements of HDFS federation, such as lower the business impact when upgrading single NameNode to federated NameNode. They have also improved NameNode Performance, which now allows supporting 600 millions of objects (files + blocks) in a single NameNode.


In YARN, Kang talked about usability improvements in YARN. Such as RMStateStore/History Server, etc. Also, he talked about multi-cluster management tools such as a unified client/RM-UI for multiple clusters. Kang also talked about improvements they have done for scheduling optimization like cache Resource Usage, improvements of utilization and preemption, etc.

7. Key Customizations of YARN @ ByteDance

Yakun Li from ByteDance talked customizations of their YARN cluster to handle extra large scale, multi-clusters environment, Including: utilization improvements, stabilization, optimizations for streaming/model-training environment, and multi datacenter issues, etc.


For scheduling, Yakun also talked about how they implement Gang Scheduling in YARN, which do scheduling for application instead of node. And it can achieve low-latency, hard/soft constraints. He also talked about implementation of multi-thread version FairScheduler which can push number of container allocation per second up to 3k.

In mixed-workloads (Batch, Streaming, ML) deployment part, Yakun talked about they have adopted Docker on YARN support to isolate dependencies. Support CPUSET/NUMA, temporarily skip nodes which have too high physical utilizations, etc. All these efforts can help mixed workload runs well in same cluster.

8. YuniKorn: A New Unified Scheduler for Both YARN and K8s

Weiwei Yang and Wangda Tan from Cloudera talked about their works about a new scheduler named YuniKorn (https://github.com/cloudera/yunikorn-core) and how it can benefit both YARN and K8s community.


Weiwei Yang (Right) and Wangda Tan (Left) from Cloudera

Scheduler of a container orchestration system, such as YARN and Kubernetes, is a critical component that users rely on to plan resources and manage applications. They have different characters to support different workloads.

YARN schedulers are optimized for high-throughput, multi-tenant batch workloads. It can scale up to 50k nodes per cluster, and schedule 20k containers per second; On the other side, Kubernetes schedulers are optimized for long-running services, but many features like hierarchical queues, fairness resource sharing, and preemption etc, are either missing or not mature enough at this point of time.

However, underneath they are responsible for one same job: the decision maker for resource allocations. They mentioned the need to run services on YARN as well as run jobs on Kubernetes. This motivates them to create a universal scheduler which can work for both YARN and Kubernetes and configured in the same way.

In this talk, Weiwei and Wangda talked about their efforts of design and implement the universal scheduler. They have integrated it with to Kubernetes already and YARN integration is working-in-progress. This scheduler brings long-wanted features such as hierarchical queues, fairness between users/jobs/queues, preemption to Kubernetes; and it brings service scheduling enhancements to YARN. Most importantly, it provides the opportunity to let YARN and Kubernetes share the same user experience on scheduling big data workloads. And any improvements of this universal scheduler can benefit both Kubernetes and YARN community.

9. HDFS cluster improvements and optimization practices in Meituan Dianping

Xiaoqiao He from Meituan Dianping talked about Hadoop cluster scalabilities now. Their Hadoop cluster keep growing since 2015. By far, there’re more than 30k nodes in the Hadoop clusters.


He shared many details and practice about the infrastructure of physical deployments, especially on solution for cluster across multiple regions. In the last part, Xiaoqiao shows some practices for optimizing HDFS cluster, such as: improve the Namenode restart process and rebalance for Namenode workload, etc.

10. Evolution of YARN in JD.com

Wanqiang Ji from JD.com talked about how YARN evolves to support JD.com’s business needs.


In the last 3 years, maximum number of nodes in a single YARN cluster scales from 3k, 5k, 10k to 16k. Internally there’re works to balance resources between YARN/K8s cluster. Also there are improvements of container eviction policies to make sure nodes won’t crash or restart when machine’s physical utilization grows above a certain level.

11. Lessons learned from large scale YARN cluster operation @ Tencent

Jun Gong and Dongdong Chen from Tencent talked about their works to support large scale YARN cluster inside tencent.


Gong Jun @ Tencent


Dongdong Chen @ Tencent

Jun and Dongdong shared inside Tencent, they widely used SLS to figure out bottleneck of scheduler, many of the scheduler improvements have contributed back to the community. After optimization, in their production cluster, they have 2k+ queues, 8K+ nodes, 5k+ concurrent jobs. And they can achieve 3k+ container allocations per second, and more than 100 millions container allocations per day.

Also, Jun and Dongdong shared how they uses YARN CGroups parameters to fine-tune CPU/Memory/Network shares for launched YARN containers in a multi-tenant cluster.

12. Run Spark and Hadoop on ARM

Rui Chen and Sheng Liu from Huawei shared their works to run Spark and Hadoop on ARM.


Rui Chen



Sheng Liu


Rui and Sheng shared the motivation of running hadoop and spark on ARM platform which is for high performance and power efficiency. After that, they went ahead to share status of ARM support for hadoop and spark and details of building release Tarball on ARM platform include parameters, and issues. In the last part, they introduced how hadoop/spark release work can make sure proper testing for arm platform and they were building a community called OpenLab to make sure the process more smoothly.



Thanks everyone for contributing this successful event in one way or another, such as following speakers:

Sammi Chen, Jun Gong and Dongdong Chen from Tencent,

Weiwei Yang, Zhankun Tang from Cloudera,

Wanqiang Ji from Jingdong,

Tao Yang from Alibaba,

Chen Zhang and Kang Zhou from Xiaomi,

Hui Fei from Didi,

Rui Chen and Sheng Liu from Huawei,

Xiaoqiao He from Meituan Dianping ,

Yakun Li from ByteDance,

and Xun Liu from Netease.

And especially thanks Chunyu Wang, Summer Xia, Katty Ma for organizing the meetup!

Thursday August 09, 2018

[ANNOUNCE] Apache Hadoop 3.1.1 release

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 3.1.1.

Hadoop 3.1.1 is the first stable maintenance release for the year 2018 in the Hadoop-3.1 line, and brings a number of enhancements.


3.1.1 is the first stable release of 3.1 line which is prod-ready.

The Hadoop community fixed 435 JIRAs [1] in total as part of the 3.1.1 release. Of these fixes:

  • 60 in Hadoop Common
  • 139 in HDFS
  • 223 in YARN
  • 13 in MapReduce
  • Apache Hadoop 3.1.1 contains a number of significant features and enhancements. A few of them are noted below.

  • ENTRY_POINT support for Docker containers.
  • Restart policy support for YARN native services.
  • Capacity Scheduler: Intra-queue preemption for fairness ordering policy.
  • Stabilization works for scheduler / YARN service / docker support, etc.
  • Please see the Hadoop 3.1.1 CHANGES) for the detailed list of issues resolved. The release news is posted on the Apache Hadoop website too, you can go to the downloads section.

    Many thanks to everyone who contributed to the release, and everyone in the Apache Hadoop community! The release is a result of direct and indirect efforts from many contributors, listed below are the those who contributed directly by submitting patches and reporting issues.

    Abhishek Modi, Ajay Kumar, Akhil PB, Akira Ajisaka, Allen Wittenauer, Anbang Hu, Andrew Wang, Arpit Agarwal, Atul Sikaria, BELUGA BEHR, Bharat Viswanadham, Bibin A Chundatt, Billie Rinaldi, Bilwa S T, Botong Huang, Brahma Reddy Battula, Brook Zhou, CR Hota, Chandni Singh, Chao Sun, Charan Hebri, Chen Liang, Chetna Chaudhari, Chun Chen, Daniel Templeton, Davide Vergari, Dennis Huo, Dibyendu Karmakar, Ekanth Sethuramalingam, Eric Badger, Eric Yang, Erik Krogen, Esfandiar Manii, Ewan Higgs, Gabor Bota, Gang Li, Gang Xie, Genmao Yu, Gergely Novák, Gergo Repas, Giovanni Matteo Fumarola, Gour Saha, Greg Senia, Haibo Yan, Hanisha Koneru, Hsin-Liang Huang, Hu Ziqian, Istvan Fajth, Jack Bearden, Jason Lowe, Jeff Zhang, Jian He, Jianchao Jia, Jiandan Yang , Jim Brennan, Jinglun, John Zhuge, Joseph Fourny, K G Bakthavachalam, Karthik Palanisamy, Kihwal Lee, Kitti Nanasi, Konstantin Shvachko, Lei (Eddy) Xu, LiXin Ge, Lokesh Jain, Lukas Majercak, Miklos Szegedi, Mukul Kumar Singh, Namit Maheshwari, Nanda kumar, Nilotpal Nandi, Pavel Avgustinov, Prabhu Joseph, Prasanth Jayachandran, Robert Kanter, Rohith Sharma K S, Rushabh S Shah, Sailesh Patel, Sammi Chen, Sean Mackrory, Sergey Shelukhin, Shane Kumpf, Shashikant Banerjee, Siyao Meng, Sreenath Somarajapuram, Steve Loughran, Suma Shivaprasad, Sumana Sathish, Sunil Govindan, Surendra Singh Lilhore, Szilard Nemeth, Takanobu Asanuma, Tao Jie, Tao Yang, Ted Yu, Thomas Graves, Thomas Marquardt, Todd Lipcon, Vinod Kumar Vavilapalli, Wangda Tan, Wei Yan, Wei-Chiu Chuang, Weiwei Yang, Wilfred Spiegelenburg, Xiao Chen, Xiao Liang, Xintong Song, Xuan Gong, Yang Wang, Yesha Vora, Yiqun Lin, Yiran Wu, Yongjun Zhang, Yuanbo Liu, Zian Chen, Zoltan Haindrich, Zsolt Venczel, Zuoming Zhang, fang zhenyi, john lilley, jwhitter, kyungwan nam, liaoyuxiangqin, liuhongtong, lujie, skrho, yanghuafeng, yimeng, Íñigo Goiri.

    [1] JIRA query: project in (YARN, HADOOP, MAPREDUCE, HDFS) AND resolution = Fixed AND fixVersion = 3.1.1 ORDER BY key ASC, updated ASC, created DESC, priority DESC

    Friday April 06, 2018

    [ANNOUNCE] Apache Hadoop 3.1.0 release

    It gives us great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 3.1.0.

    Hadoop 3.1.0 is the first minor release for the year 2018 in the Hadoop-3.x line, and brings a number of enhancements.


    This release is *not* yet ready for production use. Critical issues are being ironed out via testing and downstream adoption. Production users should wait for a 3.1.1/3.1.2 release.

    The Hadoop community fixed 768 JIRAs (https://s.apache.org/apache-hadoop-3.1.0-all-tickets) in total as part of the 3.1.0 release. Of these fixes:

    • 141 in Hadoop Common
    • 266 in HDFS
    • 329 in YARN
    • 32 in MapReduce

    Apache Hadoop 3.1.0 contains a number of significant features and enhancements. A few of them are noted below.

    Hadoop Common

    • HADOOP-14831 / HADOOP-14531 / HADOOP-14825 / HADOOP-14325. S3/S3A/S3Guard related improvements.

    Hadoop HDFS

    • HDFS-9806 - HDFS block replicas to be provided by an external storage system

    Hadoop YARN

    • YARN-6223. First class GPU support on YARN
    • YARN-5983. First class FPGA support on YARN
    • YARN-5079 / YARN-4793 / YARN-4757 / YARN-6419. YARN native service support
    • YARN-6592. Rich placement constraints in YARN
    • YARN-5881. Enable configuration of queue capacity in terms of absolute resources for Capacity Scheduler.
    • YARN-7117. Capacity Scheduler: Support auto-creation of leaf queues while doing queue mapping

    Please see the Hadoop 3.1.0 CHANGES for the detailed list of issues resolved. The release news is posted on the Apache Hadoop website too, you can go to the downloads section directly

    Many thanks to everyone who contributed to the release, and everyone in the Apache Hadoop community! The release is a result of direct and indirect efforts from many contributors, listed below are the those who contributed directly by submitting patches.

    Aaron Fabbri, Aaron Gresch, Abdullah Yousufi, Ajay Kumar, Ajith S, Akhil PB, Aki Tanaka, Akira Ajisaka, Alessandro Andrioni, Allen Wittenauer, Andras Bokor, Andrew Wang, Anis Elleuch, Anu Engineer, Arpit Agarwal, Arun Suresh, Atul Sikaria, BELUGA BEHR, Bharat Viswanadham, Bibin A Chunda, Billie Rinaldi, Botong Huang, Brahma Reddy Battula, Carlo Curino, Chandni Singh, Chang Li, Chao Sun, Charan Hebri, Chen Hongfei, Chen Liang, Chetna Chaudhari, Chris Douglas, Da Ding, Daniel Templeton, Daryn Sharp, Devaraj K, Dmitry Chuyko, Ekanth S, Elek, Marton, Ellen Hui, Eric Badger, Eric Payne, Eric Yang, Erik Krogen, Esther Kundin, Ewan Higgs, Gabor Bota, Genmao Yu, Gera Shegalov, Gergely Novák, Gergo Repas, Gour Saha, Greg Phillips, Grigori Rybkine, Haibo Chen, Haibo Yan, Hanisha Koneru, He Xiaoqiao, Igor Dvorzhak, Jack Bearden, Jason Lowe, Jian He, Jiandan Yang, Jianfei Jiang, Jim Brennan, Jinjiang Ling, Joe McDonnell, Johan Gustavsson, Johannes Alberti, John Zhuge, Jonathan Eagles, Jonathan Hung, Jongyoul Lee, Juan Rodríguez Hortalá, Junping Du, Kai Sasaki, Kannapiran Srinivasan, Karthik Kambatla, Karthik Palaniappan, Karthik Palanisamy, Keqiu Hu, Kihwal Lee, Konstantin Shvachko, Konstantinos Karanasos, Kuhu Shukla, Lei (Eddy) Xu, LiXin Ge, Lokesh Jain, Lukas Majercak, Manikandan R, Manoj Govindassamy, Masahiro Tanaka, Mikhail Erofeev, Miklos Szegedi, Ming Ma, Mukul Kumar Singh, Nanda kumar, Nandor Kollar, Nishant Bangarwa, Oleg Danilov, Panagiotis Garefalakis, PandaMonkey, Peter Bacsko, Ping Liu, Prabhu Joseph, Rahul Pathak, Rajesh Balamohan, Ray Chiang, Remi Catherino, Robert Kanter, Rohith Sharma K S, Rui Li, Rushabh S Shah, SammiChen, Sampada Dehankar, Santhosh G Nayak, Sean Mackrory, Sean Po, Sen Zhao, Shane Kumpf, Sharad Sonker, Shashikant Banerjee, Shawna Martell, Sivaguru Sankaridurg, Soumabrata Chakraborty, Stephen O'Donnell, Steve Loughran, Steven Rand, Subru Krishnan, Suma Shivaprasad, Sunil G, Surendra Singh Lilhore, Szilard Nemeth, Takanobu Asanuma, Tanuj Nayak, Tao Yang, Tarun Parimi, Thejas M Nair, Thomas Marquard, Todd Lipcon, Tsz Wo Nicholas Sze, Varada Hemeswari, Varun Saxena, Varun Vasudev, Vasudevan Skm, Vinayakumar B, Vipin Rathor, Virajith Jalaparti, Vishwajeet Dusane, Wangda Tan, Wei Yan, Weiwei Wu, Weiwei Yang, Wilfred Spiegelenburg, Xiao Chen, Xiao Liang, Xiaoyu Yao, Xuan Gong, Yang Wang, Yeliang Cang, Yesha Vora, Yiqun Lin, Yonger, Yongjun Zhang, Young Chen, Yu-Tang Lin, Yufei Gu, Zhankun Tang, Zian Chen, Zsolt Venczel, chencan, fang zhenyi, hu xiaodong, kartheek muthyala, liaoyuxiangqin, ligongyi, liuhongtong, lovekesh bansal, lufei, lujie, maobaolong, shanyu zhao, tartarus, usharani, wangzhiyuan, wujinhu, Íñigo Goiri.


    Wangda Tan and Vinod Kumar Vavilapalli

    Tuesday December 12, 2017

    Welcome to Apache™ Hadoop®!

    The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.



    Hot Blogs (today's hits)

    Tag Cloud