Apache Samza

Wednesday February 22, 2017

Announcing the release of Apache Samza 0.12.0

We are excited to announce that the Apache Samza 0.12.0 has been released.

Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber) for a few years now. Samza provides leading support for large-scale stateful stream processing with features such as:

  • First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single SSD based machine.
  • Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
  • Minimal impact during application maintenance.
In addition to general stream processing capabilities, Samza also supports:
  • A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and outputs (HDFS, Kafka, ElastiCache etc.). This allows applications to directly process data from various event sources without mandating that the data should be moved into Kafka.
  • A fully async programming model. This allows applications that make remote calls to increase parallelism very efficiently.
  • Features like canaries, upgrades and rollbacks that support extremely large deployments.
This 0.12.0 release adds several features to Samza to improve stability, performance and ease of use. Here are some highlights of this release.

Convergence of Batch and Real-time processing in Samza:
End of Stream support: Samza has always supported streaming input sources like Kafka. In such sources, it is assumed that the incoming stream of data is infinite. Samza will now have an ‘end-of-stream’ notion to support consuming from input sources that are finite (for example, on-disk files). This enables the Samza job to shut-down gracefully when it has finished consuming all data.

HDFS Consumer: Samza now provides first-class support for consuming data from HDFS files. This enables developers to define their processing logic once, and run it in both batch and streaming environments. This feature also allows for rapid experimentation with ETL’d HDFS data using Samza without the need to write a separate Hadoop job. (SAMZA-967)

Checkpoint Notifications:
Samza can now notify the SystemConsumer when performing a checkpoint. This can enable Samza to support consumers such as: Amazon Kinesis, Amazon SQS, Azure ServiceBus Queues/Topics, Google Cloud Pub-Sub, ActiveMQ, etc., which each manage checkpointing on their own. This also enables consumers to implement smart retention policies (such as deleting data once it has been consumed). (SAMZA-1042)

Support for Yarn Node Labels:
Often Samza YARN clusters have machines that are not homogenous. For example, nodes could have different memory hardware, CPUs, spinning disks or SSDs. With this feature, users can assign “labels” to nodes in their YARN cluster and use them to specify the where their Samza job should run. This feature allows flexibility in scheduling jobs based on trade-offs in resource requirements, performance and hardware costs. For example, stateful jobs can be configured to run on nodes with SSDs while stateless jobs can be configured to run on nodes with spinning disks. (SAMZA-1013)

Bug fixes:
This release also includes several critical bug-fixes and improvements for operational stability. Some notable ones include:
  • HttpFileSystem timeout for blocking reads when localizing containers (SAMZA-1079).
  • SamzaContainer should catch all Throwables instead of only exceptions (SAMZA-1077).
  • Deadlock between KafkaSystemProducer and KafkaProducer from kafka-clients lib (SAMZA-1069).
  • Change the commit order to support at least once processing when deduping with local store (SAMZA-1065).
Upgrades:
  • Upgraded Kafka version to 0.10. This enables us to take advantage of the critical fixes and improvements in Kafka.
  • Upgraded to Jetty 9 from Jetty 8.
  • Full support for Scala 2.11. All Samza jars will now have the scala version as 2.11 as a part of their file name. For example, samza-yarn_2.11-0.12.jar.
  • Samza is now source compatible with JDK 8 and above. Older JDKs are no longer supported.
Community Developments:
We made great community progress since the last release. We had two successful meetups where we presented Samza’s roadmap, and how Optimizely uses Samza. Several Samza use-cases in Uber and LinkedIn were featured in QCon 2016. Future:
There are a lot of exciting features to expect in our future release. Here are some highlights:
  • Support for Disk quota enforcement and throttling (SAMZA-956)
  • Support for high-level programming API for stream processing (SAMZA-1073)
  • Support for running Samza in stand-alone mode (SAMZA-516)
It’s a great time to get involved. You can start by reviewing the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs. I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.

Monday October 24, 2016

Announcing the release of Apache Samza 0.11.0

We are excited to announce that the Apache Samza 0.11.0 has been released.

Samza is a stable and mature Stream processing framework that has been powering real time applications across various companies in production for a few years now. Samza has industry leading support for stateful stream processing with cutting edge features like

  • Support for RocksDB based local state.
  • Incremental state checkpointing: This feature is unique compared to existing stream processing frameworks and allows Samza to support applications with large state very elegantly.
  • Minimal impact during application upgrades by minimizing state movement.
Deep support for local state allows a stateful application to scale up to 1.1 Million events/sec on a single SSD based machine.

The 0.11.0 release packs up several large improvements in runtime performance, operational stability and ease of use. Some of the key highlights include

  • Asynchronous API and processing (SAMZA-863, doc): Prior to this release, Samza only supported a synchronous single threaded process model. Increasing the number of containers (processes) to increase parallelism required a lot more memory resources. This inefficiency was more obvious for applications that make remote calls to external services/databases. With this new feature an application can increase parallelism very efficiently within a single container (process). In addition to a parallel processing model we now also support a purely asynchronous processing model which makes it a lot more efficient to perform remote I/O. In the absence of this support for async processing model, samza applications that wanted to process messages asynchronously would also had to handle the additional complexity of managing checkpointing (by disabling auto-checkpointing in Samza). With the new support for async processing, Samza will make sure that checkpointing is automatically performed only after the async calls have completed.
  • Separate Samza framework deployment from user jobs (SAMZA-849, doc): Typically in a large organization the team that manages the Samza cluster is not the same as the teams that are running applications on top of Samza. This feature allows upgrading the Samza framework without forcing developers to explicitly upgrade their running applications. With simple config changes, it supports canary, upgrade and rollback scenarios commonly required in organizations that run tens or hundreds of jobs.
  • Samza Rest API (SAMZA-865, doc): The REST API provides a rich set of operations for the users to interact with their running jobs. Samza REST API allows you to start, stop and list jobs, and also run periodic monitoring scripts. This API can be integrated with deployment tooling and job dashboard for better job management.
  • Disk monitoring (SAMZA-924): A Samza YARN cluster is used to run several stream processing applications on a shared set of physical machines. In such a multi-tenant environment it is critical to have some limits on the amount of disk space used by each job, especially to store application state. This feature introduces the measurement of the disk usage for selected job directories. The disk space usage will be gathered periodically and reported to Samza metrics. In the next release this feature will be extended to also enforce the disk quotas.
  • New metrics to troubleshoot and monitor performance issues: SAMZA-972 added holistic monitoring of memory in Samza applications. With SAMZA-963 we added the ability to troubleshoot performance issues better by isolating the time spent in the application from the time spent in accessing state.
Overall, 37 JIRAs were resolved in this release.

A source download of the 0.11.0 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.

Project Status
A total of 62 contributors have contributed to the Samza Project so far. In this release 21,473 lines of code were added/changed.

With this release we also add 3 new committers to the Apache Samza community.

Recent Community Activities
There has been a lot of activities from the community during this release time frame. Here are links to some of them.

Contribute!
There are a lot more exciting features to expect in our future release. Some of them are:

It’s a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs. Also, don’t miss out the upcoming meetup on November 2. Sign up now!

I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.

Wednesday August 10, 2016

Announcing the release of Apache Samza 0.10.1

I am excited to announce that the Apache Samza 0.10.1 has been released. This is our fourth release as an Apache Top-level Project!

Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It was originally created at LinkedIn and still continues to be used in production. The project is currently under active development with contributions from a diverse group of contributors and committers. Samza still continues to be used in production by many companies (such as Netflix, Uber, TripAdvisor etc. See PoweredBy) in the industry.

A source download of the 0.10.1 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.

Overall, 72 JIRAs were resolved in this release. This is a minor release consisting of some bug-fixes and robust improvements to features like coordinator stream, host-affinity etc. Samza continues to require Java 1.7+ and Yarn 2.6.1+.

A few notable enhancements are:

  • Support static partition assignment in ProcessJobFactory (SAMZA-41)
  • Slow start of Samza jobs with large number of containers (SAMZA-843)
  • Change log not working properly with In memory Store (SAMZA-889)
  • Refactor and fix Container allocation logic (SAMZA-866)
  • Detect partition count changes in input streams (SAMZA-882)
  • Host Affinity - State restore doesn't work if the previous shutdown was uncontrolled (continuous offset) (SAMZA-905)
  • Broadcast stream is not added properly in the prioritized tiers in the DefaultChooser (SAMZA-944)
Some notable performance improvements are:
  • Improve the performance of the continuous OFFSET checkpointing for logged stores (SAMZA-964)
  • Host Affinity - Minimize task reassignment when container count changes (SAMZA-906)
  • Improve event loop timing metrics (SAMZA-951)
  • Avoid unnecessary flushes in CachedStore (SAMZA-873)
Known issues in this release:
  • Incompatible change in Kafka producer that does not honor custom partitioners (SAMZA-839)

We've also made a lot of community progress during this release:

There are a lot more exciting features to expect in our future release. Some of them are:
  • Support multi-threading in samza tasks (SAMZA-863)
  • Disk Quotas: Add throttler and disk quota enforcement (SAMZA-956)
  • REST API for starting and stopping Samza jobs (SAMZA-865)
  • Samza standalone mode (SAMZA-516)
  • High-level language for Samza (SAMZA-390)

It’s a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs. Also, don’t miss out the upcoming meetup on August 23. Sign up now!

I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.

Tuesday December 22, 2015

Announcing the release of Apache Samza 0.10.0

I am very excited to announce that the much awaited Apache Samza 0.10.10 has been released. This is our third release as an Apache Top-level Project. Samza is a distributed stream processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. The project graduated from Apache Incubator early this year in January. It was originally created at LinkedIn and still continues to be used in production. The project is currently under active development with contributions from a diverse group of contributors and commiters. Since the last release in July 2015, there has been a significant increase in the adoption of Samza in the industry (e.g. Samza is in production in Uber and Netflix. see PoweredBy).

A source download of the 0.10.0 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.

Overall, 130 JIRAs were resolved in this release. A few highlights:

  • Introduced Coordinator Stream to support large and dynamic configuration in a Samza job (SAMZA-348), along with a command-line tool to write to the Coordinator Stream (SAMZA-704)
  • Added support for Broadcast Stream (SAMZA-676)
  • Implemented host-affinity feature in Yarn for more robust recovery of stateful jobs (SAMZA-617)
  • Upgraded RocksDB JNI version to 3.13.1 (SAMZA-747), along with support for TTL (SAMZA-537)
  • Introduced HDFS producer (SAMZA-693) and ElasticSearch (SAMZA-654) producer, to allow writing directly from Samza to HDFS stores and ElasticSearch respectively
  • Implemented tools to better support troubleshooting of RocksDB stores in the job (SAMZA-598)
  • Fixed some performance and stability issues that got introduced (SAMZA-798, SAMZA-754, SAMZA-723)

Known issues in this release:

  • Negative RocksDB TTL is not handled properly (SAMZA-838)
  • Slow start of Samza jobs with large number of containers (SAMZA-843)
  • Incompatible change in Kafka producer that does not honor custom partitioners (SAMZA-839)

We've also made a lot of community progress during this release:

  • Added 3 more companies in the powered by page (Uber, State.com, Netflix)
  • 2 Successful meetups were held - one in July and the other in October
  • Accepted patches from 37 distinct contributors
  • 917 emails sent to the developer mailing list in past 3 months

There are a lot to exciting features to expect in our future release. Some of them are:

Starting 0.10.0 release, Samza will require java 1.7+ and Yarn 2.6.1+.

It’s a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs.

I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.

Monday July 13, 2015

Announcing the release of Apache Samza 0.9.1

I am very excited to announce that Apache Samza 0.9.1 has been released. It's our second release as an Apache Top-level Project. Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. The project entered Apache Incubator in 2013 and was originally created at LinkedIn, where it's in production use, and then graduated from Apache Incubator in Jan, 2015. The project is currently under active development from a diverse group of contributors and committers.

A source download of the 0.9.1 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.

As a bug-fix version, in all, 7 JIRAs were resolved in this release. A few highlights:

  • Iterator.remove breaks caching layer (SAMZA-658)
  • Shutdown hook does not wait for container to finish (SAMZA-616)
  • Deserialization error causes SystemConsumers to hang (SAMZA-608)
  • Samza auto-creates changelog stream without sufficient partitions when container number > 1 (SAMZA-662)
  • Bootstrap hangs (SAMZA-720)
  • Fix warnings in samza-api Javadocs (SAMZA-712)

We've also made some community progress during this release:

There are a lot exciting features to expect in our future release. Some of them are:

0.9.1 release will still support java 1.6 to maintain backward compatibility with 0.9.0. We will require java 1.7+ since 0.10.0 release.

Now is a good time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs.

I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.

Friday April 03, 2015

Announcing the release of Apache Samza 0.9.0

I am very excited to announce that Apache Samza 0.9.0 has been released. It's our first release as an Apache Top-level Project. Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. The project entered Apache Incubator in 2013 and was originally created at LinkedIn, where it's in production use, and then graduated from Apache Incubator in Jan, 2015. The project is currently under active development from a diverse group of contributors and committers.

A source download of the 0.9.0 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.

In all, 95 JIRAs were resolved in this release. A few highlights:

We've also made a lot of community progress during this release:

There are a lot to exciting features to expect in our future release. Some of them are:

0.9.0 release will be our last release to support java 1.6. We will require java 1.7+ since 0.10.0 release.

Now is a good time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs.

I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.

Tuesday December 09, 2014

Announcing the release of Apache Incubator Samza 0.8.0

I am very excited to announce that Apache Incubator Samza 0.8.0 has been released. Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. The project entered Apache Incubator in 2013 and was originally created at LinkedIn, where it's in production use. The project is currently under active development from a diverse group of committers. This release builds off of our past 0.7.0 release, and is likely to be our last release as an incubating Apache project before we graduate to a top level project.

A source download of the 0.8.0 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.

In all, 136 JIRAs were resolved in this release. Notable work done includes:

  • Made major performance improvements. A single SamzaContainer can process over 1,000,000 messages/sec now. (SAMZA-245)
  • Added RocksDB state management support. (SAMZA-236)
  • Added support for pluggable partition-container assignment strategies. (SAMZA-123)
  • Added support for Java 8 and Gradle 2.0, and dropped support for Scala 2.8 and 2.9. (SAMZA-202)
  • Upgraded YARN support to 2.4.0. (SAMZA-186, SAMZA-58)
  • Several metrics improvements, including adding a new timer metric. (SAMZA-349, SAMZA-407, SAMZA-408)
  • Made Samza's checkpoint topics smaller by taking advantage of Kafka's log compaction feature. (SAMZA-388)
  • Added an in-memory key-value store that can be used in place of RocksDB/LevelDB for small state. (SAMZA-256)
  • Completely overhauled Samza's YARN AM UI to make it much cleaner and more functional. (SAMZA-32)
  • Fixed several usability issues to make configuring JVM properties easier. (SAMZA-276, SAMZA-20, SAMZA-377, SAMZA-109)

We've also made a lot of community progress during this release:

Even after all this work, there's still a lot to be done. In our next release (0.9.0), we're planning to work on:
  • Configuring Samza jobs through a stream. (SAMZA-348)
  • Supporting Scala 2.11. (SAMZA-469)
  • Upgrading Samza's Kafka producer API. (SAMZA-227)
  • Publish container logs to a stream to integrate with ELK. (SAMZA-310)

Now is a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs.

I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.

Friday July 11, 2014

Announcing the release of Apache Incubator Samza 0.7.0

I am very excited to announce that Apache Incubator Samza 0.7.0 has been released. Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. The project entered Apache Incubator in 2013 and was originally created at LinkedIn, where it's in production use. The project is currently under active development from a diverse group of committers.

A source download of the 0.7.0 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.

In all, 156 JIRAs were resolved in this release. Notable work done includes:

We've also made a lot of community progress during this release:

Even after all this work, there's still a lot to be done. In our next release (0.8.0), we're planning to focus on performance. This work includes:

Now is a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs.

I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation