Kafka

Wednesday November 07, 2018

Apache Kafka Supports 200K Partitions Per Cluster

In Kafka, a topic can have multiple partitions to which records are distributed. Partitions are the unit of parallelism. In general, more partitions leads to higher throughput. However, there are some factors that one should consider when having more partitions in a Kafka cluster. I am happy to report that the recent Apache Kafka 1.1.0 release has significantly increased the number of partitions that a single Kafka cluster can support from the deployment and the availability perspective.

To understand the improvement, it’s useful to first revisit some of the basics about partition leaders and the controller. First, each partition can have multiple replicas for higher availability and durability. One of the replicas is designated as the leader and all the client requests are served from this lead replica. Second, one of the brokers in the cluster acts as the controller that manages the whole cluster. If a broker fails, the controller is responsible for selecting new leaders for partitions on the failed broker.

The Kafka broker does a controlled shutdown by default to minimize service disruption to clients. A controlled shutdown has the following steps. (1) A SIG_TERM signal is sent to the broker to be shut down. (2) The broker sends a request to the controller to indicate that it’s about to shut down. (3) The controller then changes the partition leaders on this broker to other brokers and persists that information in ZooKeeper. (4) The controller sends the new leader to other brokers. (5) The controller sends a successful reply to the shutting down broker, which finally terminates its process. At this point, there is no impact to the clients since their traffic has already been moved to other brokers. This process is depicted in Figure 1 below. Note that step (4) and (5) can happen in parallel.


Figure 1. (1) shutdown initiated on broker 1; (2) broker 1 sends controlled shutdown request to controller on broker 0; (3) controller writes new leaders in ZooKeeper; (4) controller sends new leaders to broker 2; (5) controller sends successful reply to broker 1.

Before Kafka 1.1.0, during the controlled shutdown, the controller moves the leaders one partition at a time. For each partition, the controller selects a new leader, writes it to ZooKeeper synchronously and communicates the new leader to other brokers through a remote request. This process has a couple of inefficiencies. First, the synchronous writes to ZooKeeper have higher latency, which slows down the controlled shutdown process. Second, communicating the new leader one partition at a time adds many small remote requests to every broker, which can cause the processing of the new leaders to be delayed.

In Kafka 1.1.0, we made significant improvements in the controller to speed up the controlled shutdown. The first improvement is using the asynchronous API when writing to ZooKeeper. Instead of writing the leader for one partition, waiting for it to complete and then writing another one, the controller submits the leader for multiple partitions to ZooKeeper asynchronously and then waits for them to complete at the end. This allows for request pipelining between the Kafka broker and the ZooKeeper server and reduces the overall latency. The second improvement is that the communication of the new leaders is batched. Instead of one remote request per partition, the controller sends a single remote request with the leader from all affected partitions.

We also made significant improvement in the controller failover time. If the controller goes down, the Kafka cluster automatically elects another broker as the new controller. Before being able to elect the partition leaders, the newly-elected controller has to first reload the state of all partitions in the cluster from ZooKeeper. If the controller has a hard failure, the window in which a partition is unavailable can be as long as the ZooKeeper session expiration time plus the controller state reloading time. So reducing the state reloading time improves the availability in this rare event. Prior to Kafka 1.1.0, the reloading uses the synchronous ZooKeeper API. In Kafka 1.1.0, this is changed to also use the asynchronous API for better latency.

We executed tests to evaluate the performance improvement of the controlled shutdown time and the controller reloading time. For both tests, we set up a 5 node ZooKeeper ensemble on different server racks.

In the first test, we set up a Kafka cluster with 5 brokers on different racks. In that cluster, we created 25,000 topics, each with a single partition and 2 replicas, for a total of 50,000 partitions. So, each broker has 10,000 partitions. We then measured the time to do a controlled shutdown of a broker. The results are shown in the table below.

kafka 1.0.0 kafka 1.1.0
controlled shutdown time 6.5 minutes 3 seconds

A big part of the improvement comes from fixing a logging overhead, which unnecessarily logs all partitions in the cluster every time the leader of a single partition changes. By just fixing the logging overhead, the controlled shutdown time was reduced from 6.5 minutes to 30 seconds. The asynchronous ZooKeeper API change reduced this time further to 3 seconds. These improvements significantly reduce the time to restart a Kafka cluster.

In the second test, we set up another Kafka cluster with 5 brokers and created 2,000 topics, each with 50 partitions and 1 replica. This makes a total of 100,000 partitions in the whole cluster. We then measured the state reloading time of the controller and observed a 100% improvement (the reloading time dropped from 28 seconds in Kafka 1.0.0 to 14 seconds in Kafka 1.1.0).

With those improvements, how many partitions can one expect to support in Kafka? The exact number depends on factors such as the tolerable unavailability window, ZooKeeper latency, broker storage type, etc. As a rule of thumb, we recommend each broker to have up to 4,000 partitions and each cluster to have up to 200,000 partitions. The main reason for the latter cluster-wide limit is to accommodate for the rare event of a hard failure of the controller as we explained earlier. Note that other considerations related to partitions still apply and one may need some additional configuration tuning with more partitions.

The improvement that we made in 1.1.0 is just one step towards our ultimate goal of making Kafka infinitely scalable. We have also made improvements on latency with more partitions in 1.1.0 and will discuss that in a separate blog. In the near future, we plan to make further improvements to support millions of partitions in a Kafka cluster.

The controller improvement work in Kafka 1.1.0 is a true community effort. Over a period of 9 months, people from 6 different organizations helped out and made this happen. Onur Karaman led the original design, did the core implementation and conducted the performance evaluation. Manikumar Reddy, Prasanna Gautam, Ismael Juma, Mickael Maison, Sandor Murakozi, Rajini Sivaram and Ted Yu each contributed some additional implementation or bug fixes. More details can be found in KAFKA-5642 and KAFKA-5027. A big thank you to all the contributors!

Comments:

very informative blog i really like your website it your posts blogs are technical and helpful.

Posted by webdevelopmentcompany on December 28, 2018 at 05:32 PM UTC #

Figure 1. (1) shutdown initiated on broker 1; (2) broker 1 sends controlled shutdown request to controller on broker 0; (3) controller writes new leaders in ZooKeeper; (4) controller sends new leaders to broker 2; (5) controller sends successful reply to broker 1.

Posted by film izle on June 29, 2019 at 08:44 PM UTC #

very informative tool

Posted by nayra on July 11, 2019 at 03:27 PM UTC #

reat neat! nice work wow, so quirky design. Love the colour, and compositions.

Posted by protection de la peau on July 20, 2019 at 04:29 PM UTC #

Apache Kafka has improved tremendously in the number of partitions a single Kafka cluster and can support while preserving high availability and durability.

Posted by https://online-application.org/social-security-administration-office/oklahoma/ on August 02, 2019 at 01:58 AM UTC #

Really impressive. We have used Kafka for functionalities like Continue Watching data processing from players. I think we can get a lot more other things to do with Apache kafka.

Posted by Cider on August 02, 2019 at 12:03 PM UTC #

Thanks for the great post. Really informative for using the Kafka brokers in the list.

Posted by MPL on August 02, 2019 at 06:41 PM UTC #

Figure 1. (1) shutdown initiated on broker 1; (2) broker 1 sends controlled shutdown request to controller on broker 0; (3) controller writes new leaders in ZooKeeper; (4) controller sends new leaders to broker 2; (5) controller sends successful reply to broker 1.

Posted by Apps on August 02, 2019 at 06:42 PM UTC #

With this post you just saved me, the whole point is that I am currently working on my project http://apksnake.com/ and it was precisely this information that I did not have, thanks for the help!

Posted by Apksnake on August 08, 2019 at 08:41 PM UTC #

Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies, and with that, (https://www.junkyarddogonline.com/) wants to learn and have it!

Posted by Brielle Luna on August 14, 2019 at 03:49 PM UTC #

I also agree with this one and this is the best thing to do actually. https://www.teetherpop.com/

Posted by Horea Kaii on August 20, 2019 at 02:20 PM UTC #

Kafka already made a very significant improvement in terms of number of partitions a single Kafka can support. https://www.watersoftenergurus.com/fleck-water-softener/

Posted by Earl Bennett on August 27, 2019 at 07:38 AM UTC #

There is a very significant difference in the controlled shutdown time that's why it operates faster. https://www.assistedonlinefilings.com/

Posted by Roy on September 03, 2019 at 09:55 AM UTC #

This is going to be epic for helping me out. Appreciate the write up.

Posted by ryan on November 06, 2019 at 06:29 PM UTC #

Thank you for the write-up. <a href="https://www.lafavellc.com/">https://www.lafavellc.com/</a>

Posted by James on November 06, 2019 at 06:34 PM UTC #

https://www.sanjosejunkremoval.net/ Thank you for the write up. We use same thing.

Posted by Juan on November 09, 2019 at 11:23 PM UTC #

I couldn't agree more. Very good explanation. https://www.gallerycompany.com/

Posted by Erica on November 10, 2019 at 08:01 PM UTC #

Figure 1. (1) shutdown initiated on broker 1; (2) broker 1 sends controlled shutdown request to the controller on broker 0; (3) controller writes new leaders in ZooKeeper; (4) controller sends new leaders to broker 2; (5) controller sends a successful reply to broker 1.? https://www.bcslocksmiths.com/

Posted by Brian on November 10, 2019 at 08:05 PM UTC #

I like the way these tests were executed https://www.onlinemarketingshark.com/

Posted by Michael on November 11, 2019 at 05:31 PM UTC #

I appreciate for allowing to comment on this subject. http://www.lawncareportorange.com/

Posted by Greg on November 12, 2019 at 08:47 PM UTC #

I think this is an insanely power dialogue. Thank you for sharing. http://www.mysmartinspection.com/

Posted by Shelly on November 13, 2019 at 10:11 PM UTC #

Excellent explanation!! https://www.topazcleaning.com/

Posted by Earnest on November 15, 2019 at 05:04 PM UTC #

I like the split of the partitions. That explains a lot. How does the broker shut down when requested? https://www.glendaleconcretesolutions.com/

Posted by Simon Ellis on November 19, 2019 at 06:05 PM UTC #

I thought that since Kafka was a system optimized for writing using a writer’s name would make sense. I had taken a lot of lit classes in colleague and liked Franz Kafka. Plus the name sounded cool for an OS project. James from https://www.realestateinbudapest.com

Posted by James on November 26, 2019 at 12:09 AM UTC #

Apache Kafka's popularity is exploding. How can I learn more about it? https://www.smarthouselondon.co.uk/smart-lock.html

Posted by Steve Mendalson on November 26, 2019 at 12:12 AM UTC #

What are the most important things you learned from building Apache Kafka? https://www.treeleeds.co.uk

Posted by Jay S on November 26, 2019 at 12:14 AM UTC #

请问一下怎样计算 Kafka 集群最大支持的 partition 数量呢?谢谢 目前我们有一个 kafka 集群(5台,内存128MB,cpu 16 core), 当partition number (每个 topic 有 2个或者3个分区,每条 messege 大约 500KB)达到 36000 个以后,占用的 virt memory 达到 370GB,这不正常。请问有解决的方法吗?扩容?谢谢

Posted by Jeff Yang on November 29, 2019 at 06:00 AM UTC #

I thought that since Kafka was a system optimized for writing using a writer’s name would make sense. I had taken a lot of lit classes in colleague and liked Franz Kafka. Plus the name sounded cool for an OS project.

Posted by Shubham on December 04, 2019 at 06:15 AM UTC #

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation