Entries tagged [apache]
Announcing the release of Apache Samza 0.11.0
We are excited to announce that the Apache Samza 0.11.0 has been released.
Samza is a stable and mature Stream processing framework that has been powering real time applications across various companies in production for a few years now. Samza has industry leading support for stateful stream processing with cutting edge features like
- Support for RocksDB based local state.
- Incremental state checkpointing: This feature is unique compared to existing stream processing frameworks and allows Samza to support applications with large state very elegantly.
- Minimal impact during application upgrades by minimizing state movement.
The 0.11.0 release packs up several large improvements in runtime performance, operational stability and ease of use. Some of the key highlights include
- Asynchronous API and processing (SAMZA-863, doc): Prior to this release, Samza only supported a synchronous single threaded process model. Increasing the number of containers (processes) to increase parallelism required a lot more memory resources. This inefficiency was more obvious for applications that make remote calls to external services/databases. With this new feature an application can increase parallelism very efficiently within a single container (process). In addition to a parallel processing model we now also support a purely asynchronous processing model which makes it a lot more efficient to perform remote I/O. In the absence of this support for async processing model, samza applications that wanted to process messages asynchronously would also had to handle the additional complexity of managing checkpointing (by disabling auto-checkpointing in Samza). With the new support for async processing, Samza will make sure that checkpointing is automatically performed only after the async calls have completed.
- Separate Samza framework deployment from user jobs (SAMZA-849, doc): Typically in a large organization the team that manages the Samza cluster is not the same as the teams that are running applications on top of Samza. This feature allows upgrading the Samza framework without forcing developers to explicitly upgrade their running applications. With simple config changes, it supports canary, upgrade and rollback scenarios commonly required in organizations that run tens or hundreds of jobs.
- Samza Rest API (SAMZA-865, doc): The REST API provides a rich set of operations for the users to interact with their running jobs. Samza REST API allows you to start, stop and list jobs, and also run periodic monitoring scripts. This API can be integrated with deployment tooling and job dashboard for better job management.
- Disk monitoring (SAMZA-924): A Samza YARN cluster is used to run several stream processing applications on a shared set of physical machines. In such a multi-tenant environment it is critical to have some limits on the amount of disk space used by each job, especially to store application state. This feature introduces the measurement of the disk usage for selected job directories. The disk space usage will be gathered periodically and reported to Samza metrics. In the next release this feature will be extended to also enforce the disk quotas.
- New metrics to troubleshoot and monitor performance issues: SAMZA-972 added holistic monitoring of memory in Samza applications. With SAMZA-963 we added the ability to troubleshoot performance issues better by isolating the time spent in the application from the time spent in accessing state.
A total of 62 contributors have contributed to the Samza Project so far. In this release 21,473 lines of code were added/changed.
With this release we also add 3 new committers to the Apache Samza community.
Recent Community Activities
There has been a lot of activities from the community during this release time frame. Here are links to some of them.
- Stream processing Meetup @ LinkedIn
- Detailed list of links to other presentations can be found here
There are a lot more exciting features to expect in our future release. Some of them are:
- Samza operators API (SAMZA-914)
- HDFS system consumer (SAMZA-967)
- Support for standalone Samza jobs (SAMZA-516)
- Disk quotas enforcement (SAMZA-956)
I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.
Announcing the release of Apache Samza 0.10.1
I am excited to announce that the Apache Samza 0.10.1 has been released. This is our fourth release as an Apache Top-level Project!
Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It was originally created at LinkedIn and still continues to be used in production. The project is currently under active development with contributions from a diverse group of contributors and committers. Samza still continues to be used in production by many companies (such as Netflix, Uber, TripAdvisor etc. See PoweredBy) in the industry.
Overall, 72 JIRAs were resolved in this release. This is a minor release consisting of some bug-fixes and robust improvements to features like coordinator stream, host-affinity etc. Samza continues to require Java 1.7+ and Yarn 2.6.1+.
A few notable enhancements are:
- Support static partition assignment in ProcessJobFactory (SAMZA-41)
- Slow start of Samza jobs with large number of containers (SAMZA-843)
- Change log not working properly with In memory Store (SAMZA-889)
- Refactor and fix Container allocation logic (SAMZA-866)
- Detect partition count changes in input streams (SAMZA-882)
- Host Affinity - State restore doesn't work if the previous shutdown was uncontrolled (continuous offset) (SAMZA-905)
- Broadcast stream is not added properly in the prioritized tiers in the DefaultChooser (SAMZA-944)
- Improve the performance of the continuous OFFSET checkpointing for logged stores (SAMZA-964)
- Host Affinity - Minimize task reassignment when container count changes (SAMZA-906)
- Improve event loop timing metrics (SAMZA-951)
- Avoid unnecessary flushes in CachedStore (SAMZA-873)
- Incompatible change in Kafka producer that does not honor custom partitioners (SAMZA-839)
We've also made a lot of community progress during this release:
- We had 2 successful meetups - one in February and the other in June. The upcoming meetup is scheduled for August 23.
- Apache Samza was presented at the Apache Big Data (North America) conference in May 2016 and at the Hadoop Summit in June 2016. Check out the content here.
- Samza paper/workshop was also accepted at notable academic conferences:
- SamzaSQL: Scalable Fast Data Management with Streaming SQL presented at IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) in May 2016
- Effective Multi-stream Joining in Apache Samza Framework in 5th IEEE International Congress on Big Data, June 27 - July 2, 2016, San Francisco, USA
- 380 emails sent to the developer mailing list in past 3 months
- Support multi-threading in samza tasks (SAMZA-863)
- Disk Quotas: Add throttler and disk quota enforcement (SAMZA-956)
- REST API for starting and stopping Samza jobs (SAMZA-865)
- Samza standalone mode (SAMZA-516)
- High-level language for Samza (SAMZA-390)
It’s a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs. Also, don’t miss out the upcoming meetup on August 23. Sign up now!I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.