Hadoop Community Meetup @ Beijing, Aug 2019
Hadoop Community Meetup @ Beijing
Author: Junping Du (Tencent) & Wangda Tan (Cloudera)
On Aug 11th 2019, Hadoop developers/users gathered together at Tencent’s Sigma Center office in Beijing to share their latest works, with 12 presentations by engineers from Tencent, Cloudera, Alibaba, Didi, Xiaomi, Meituan, ByteDance (Parent company of TikTok, Toutiao, etc.), JD.com, Huawei. This is also first Hadoop community meetup hosted by Apache Hadoop PMC members.
We received tremendous numbers of participations to the meetup. There’re 200 spots available for registration to attend this meetup in-person, and spots got fully booked in 10 mins. We got 150+ attendees in-person, and 3000+ attendees participated online live sessions.
We have participants from dozens of different companies and universities in person, many of them are flying from Shanghai, Hangzhou, Shenzhen and even San Francisco Bay Area!
1. Hadoop Community Update And Roadmaps
Junping Du @ Tencent and Wangda Tan @ Cloudera talked about Hadoop community updates and roadmaps.
Junping Du @ Tencent
Wangda Tan @ Cloudera
Junping introduced recent trends in the storage field, such as better scalability and moving to cloud. He talked about features like RBF (Router Based Federation), improvements of NameNode scalability, Improvements of cloud connectors and Ozone.
Wangda talked about recent trends in the compute field, such as better scalability, moving to clkoud-native environment, containerization works and support of Machine-Learning use cases. He talked about global scheduling framework for better scheduling throughput and placement quality. Recent containerization works in YARN such as runc, interactive docker shell. And YARN-on-cloud initiatives from community such as autoscaling, graceful decommissions, etc. Wangda also talked about Submarine and its release plans.
At last, Wangda looked back at releases in 2018/2019, and shared tentative release plan of Hadoop in 2019. Such as 3.1.3, 3.2.1 and what’s new coming to 3.3.0.
2. Ozone: Hadoop native object store
Sammi (Yi) Chen @ Tencent talked about native object store project from Hadoop community.
Ozone is a strong-consistent distributed object store service. Like HDFS, Ozone has same level of reliability, consistency and usability. It supports S3 interface, so it is not only useful to on-prem big-data workload. It is also a good option to move big data to cloud.
Sammi talked about architecture of Ozone, and what’s new in Ozone 0.5 release.
3. YARN 3.x in Alibaba
Tao Yang from Alibaba talked about Hadoop use cases in Alibaba. He also talked about how new features in YARN 3.x being used to solve use cases. Tao talked about features like preemption, scheduling, resource over-commitment, scheduling diagnostic, mixed deployment of online/offline workload. Tao also talked about how new features in YARN help to better run Apache Flink on YARN.
Tao talked about many interesting features such as MultiNodeLookupPolicy, which can help schedule jobs on a pluggable node sorter.
4. HDFS Best Practices learned from Didi’s production environment.
Hui Fei from Didi talked about HDFS best practices learned from Didi’s large scale (hundreds of PBs) production environment.
Hui first talked about storage use cases and scale in Didi’s environment. Then Hui talked about functionalities and improvements Didi’s Hadoop team built on top of Hadoop HDFS 2.7.2 such as: Security, NameNode Federation, Balancer, etc.
Hui also talked about the status of upgrading production cluster based on Hadoop 2.7.2 to Hadoop 3.2.0. The primary driver of upgrade is to save storage spaces. Didi wants to use features like Erasure Coding in Hadoop 3.x.
Didi has upgraded a test cluster (100+ nodes) from 2.7.2 to 3.2.0, has a backup cluster with 2k+ nodes run Hadoop 3.1.1 and will rolling upgrade it to 3.2.0. There’s a primary cluster with 10K+ nodes (with 5 namespaces), will start to upgrade to 3.2.0 starting Oct
5. Submarine: A one-stop, cross-platform machine learning platform
Xun Liu @ NetEase and Zhankun Tang @ Cloudera talked about background, existing status and future of Submarine project.
Zhankun Tang @ Cloudera
Xun Liu @ Netease
Machine learning includes many components like data-preprocessing, feature extraction, model training/serving/management, distributed workload management. Submarine project started by Hadoop community is targeted to achieve these goals by focusing on Notebook experiences. With Submarine, data scientists or machine learning engineer don’t need to understand lower-level platform such as YARN, K8s, Docker container.
Zhankun showed a new feature called mini-submarine which allows developers try Submarine locally without installing a YARN cluster.
Xun did demos for:
● Integration of Submarine + Zeppelin notebook.
● New Submarine web UI to allow data scientists to run jobs and manage models, etc. in the unified user experiences.
Xun also talked about companies which are reported using Submarine in production. Such as NetEase, Linkedin, Dahua, Ke.com, JD.com.
6. Hadoop Improvements in Xiaomi
Chen Zhang and Kang Zhou from Xiaomi talked about how Hadoop is being used in Xiaomi. They talked about improvements of HDFS’s performance and scalability; Problems/Solutions when trying to platformize YARN.
For HDFS side, Chen talked about their improvements of HDFS federation, such as lower the business impact when upgrading single NameNode to federated NameNode. They have also improved NameNode Performance, which now allows supporting 600 millions of objects (files + blocks) in a single NameNode.
In YARN, Kang talked about usability improvements in YARN. Such as RMStateStore/History Server, etc. Also, he talked about multi-cluster management tools such as a unified client/RM-UI for multiple clusters. Kang also talked about improvements they have done for scheduling optimization like cache Resource Usage, improvements of utilization and preemption, etc.
7. Key Customizations of YARN @ ByteDance
Yakun Li from ByteDance talked customizations of their YARN cluster to handle extra large scale, multi-clusters environment, Including: utilization improvements, stabilization, optimizations for streaming/model-training environment, and multi datacenter issues, etc.
For scheduling, Yakun also talked about how they implement Gang Scheduling in YARN, which do scheduling for application instead of node. And it can achieve low-latency, hard/soft constraints. He also talked about implementation of multi-thread version FairScheduler which can push number of container allocation per second up to 3k.
In mixed-workloads (Batch, Streaming, ML) deployment part, Yakun talked about they have adopted Docker on YARN support to isolate dependencies. Support CPUSET/NUMA, temporarily skip nodes which have too high physical utilizations, etc. All these efforts can help mixed workload runs well in same cluster.
8. YuniKorn: A New Unified Scheduler for Both YARN and K8s
Weiwei Yang and Wangda Tan from Cloudera talked about their works about a new scheduler named YuniKorn (https://github.com/cloudera/yunikorn-core) and how it can benefit both YARN and K8s community.
Weiwei Yang (Right) and Wangda Tan (Left) from Cloudera
Scheduler of a container orchestration system, such as YARN and Kubernetes, is a critical component that users rely on to plan resources and manage applications. They have different characters to support different workloads.
YARN schedulers are optimized for high-throughput, multi-tenant batch workloads. It can scale up to 50k nodes per cluster, and schedule 20k containers per second; On the other side, Kubernetes schedulers are optimized for long-running services, but many features like hierarchical queues, fairness resource sharing, and preemption etc, are either missing or not mature enough at this point of time.
However, underneath they are responsible for one same job: the decision maker for resource allocations. They mentioned the need to run services on YARN as well as run jobs on Kubernetes. This motivates them to create a universal scheduler which can work for both YARN and Kubernetes and configured in the same way.
In this talk, Weiwei and Wangda talked about their efforts of design and implement the universal scheduler. They have integrated it with to Kubernetes already and YARN integration is working-in-progress. This scheduler brings long-wanted features such as hierarchical queues, fairness between users/jobs/queues, preemption to Kubernetes; and it brings service scheduling enhancements to YARN. Most importantly, it provides the opportunity to let YARN and Kubernetes share the same user experience on scheduling big data workloads. And any improvements of this universal scheduler can benefit both Kubernetes and YARN community.
9. HDFS cluster improvements and optimization practices in Meituan Dianping
Xiaoqiao He from Meituan Dianping talked about Hadoop cluster scalabilities now. Their Hadoop cluster keep growing since 2015. By far, there’re more than 30k nodes in the Hadoop clusters.
He shared many details and practice about the infrastructure of physical deployments, especially on solution for cluster across multiple regions. In the last part, Xiaoqiao shows some practices for optimizing HDFS cluster, such as: improve the Namenode restart process and rebalance for Namenode workload, etc.
10. Evolution of YARN in JD.com
Wanqiang Ji from JD.com talked about how YARN evolves to support JD.com’s business needs.
In the last 3 years, maximum number of nodes in a single YARN cluster scales from 3k, 5k, 10k to 16k. Internally there’re works to balance resources between YARN/K8s cluster. Also there are improvements of container eviction policies to make sure nodes won’t crash or restart when machine’s physical utilization grows above a certain level.
11. Lessons learned from large scale YARN cluster operation @ Tencent
Jun Gong and Dongdong Chen from Tencent talked about their works to support large scale YARN cluster inside tencent.
Gong Jun @ Tencent
Dongdong Chen @ Tencent
Jun and Dongdong shared inside Tencent, they widely used SLS to figure out bottleneck of scheduler, many of the scheduler improvements have contributed back to the community. After optimization, in their production cluster, they have 2k+ queues, 8K+ nodes, 5k+ concurrent jobs. And they can achieve 3k+ container allocations per second, and more than 100 millions container allocations per day.
Also, Jun and Dongdong shared how they uses YARN CGroups parameters to fine-tune CPU/Memory/Network shares for launched YARN containers in a multi-tenant cluster.
12. Run Spark and Hadoop on ARM
Rui Chen and Sheng Liu from Huawei shared their works to run Spark and Hadoop on ARM.
Rui and Sheng shared the motivation of running hadoop and spark on ARM platform which is for high performance and power efficiency. After that, they went ahead to share status of ARM support for hadoop and spark and details of building release Tarball on ARM platform include parameters, and issues. In the last part, they introduced how hadoop/spark release work can make sure proper testing for arm platform and they were building a community called OpenLab to make sure the process more smoothly.
Thanks everyone for contributing this successful event in one way or another, such as following speakers:
Sammi Chen, Jun Gong and Dongdong Chen from Tencent,
Weiwei Yang, Zhankun Tang from Cloudera,
Wanqiang Ji from Jingdong,
Tao Yang from Alibaba,
Chen Zhang and Kang Zhou from Xiaomi,
Hui Fei from Didi,
Rui Chen and Sheng Liu from Huawei,
Xiaoqiao He from Meituan Dianping ,
Yakun Li from ByteDance,
and Xun Liu from Netease.
And especially thanks Chunyu Wang, Summer Xia, Katty Ma for organizing the meetup!