Apache Samza
Announcing the release of Apache Samza 1.8.0
We are thrilled to announce the release of Apache Samza 1.8.0.
Key Features:
Below is a list of key features that we intend to include in this release:
- [SEP-31] : Pipeline Drain- Support the ability to drain pipelines to allow incompatible intermediate schema changes
- [SAMZA-2757] : Make Samza Compatible with Java 11
Full list of the jira tickets addressed in this release can be found [here]
Upgrade Instructions
For applications that are already on Samza 1.7.0, updating your dependencies to use Samza 1.8.0 should be sufficient to upgrade.
For applications that are on version 1.6 & below, please see instructions for 1.7.0 upgrade.
Sources downloads
A source download of Samza 1.8.0 is available here, and is also available in Apache’s Maven repository.
See Samza’s download page for details and Samza’s feature preview for new features.
Posted at 01:09AM Jan 19, 2023
by xinyu in General |
|
Announcing the release of Apache Samza 1.7.0
We are thrilled to announce the release of Apache Samza 1.7.0
Key Features:
Below is a list of key features that we intend to include in this release:
- [SEP-28] : Samza State Backend Interface and Checkpointing Improvements
- [SEP-29] : Blob Store as backend for Samza State backup and restore
- [SEP-30] : Adding partial update api to Table API
Full list of the Jira tickets addressed in this release can be found [here]
Upgrade Instructions:
For applications that are already on Samza 1.6.0, updating your dependencies to use Samza 1.7.0 should be sufficient to upgrade.
For applications that are on version 1.5 & below, please see instructions for 1.6.0 upgrade.
Sources downloads
A source download of Samza 1.7.0 is available [here], and is also available in Apache’s Maven repository.
See Samza’s download [page] for details and Samza’s feature preview for new features.
Posted at 01:07AM Jan 19, 2023
by xinyu in General |
|
Announcing the release of Apache Samza 1.5.0
IMPORTANT NOTE: As noted in the last release, this release contains backward incompatible changes regarding samza job submission. Details can be found on SEP-23: Simplify Job Runner
We are thrilled to announce the release of Apache Samza 1.5.0.
Today, Samza forms the backbone of hundreds of real-time production applications across a multitude of companies, such as LinkedIn, Slack, and Redfin, among many others. Samza provides leading support for large-scale stateful stream processing with:
First class support for local states (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large states.
A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
High level API for expressing complex stream processing pipelines in a few lines of code.
Beam Samza Runner that marries Beam’s best in class support for EventTime based windowed processing and sophisticated triggering with Samza’s stable and scalable stateful processing model.
A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
A Table API that provides a common abstraction for accessing remote or local databases and allows developers to “join” an input event stream with such a Table.
Flexible deployment model for running the applications in any hosting environment and with cluster managers other than YARN.
New Features, Upgrades and Bug Fixes:
This release brings the following features, upgrades, and capabilities (highlights):
Samza Container Placement
Container Placements API gives you the ability to move / restart one or more containers (either active or standby) of your cluster based applications from one host to another without restarting your application. You can use these api to build maintenance, balancing & remediation tools.
Simplify Job Runner & Configs
Job Runner will now simply submit Samza job to Yarn RM without executing any user code and job planning will happen on ClusterBasedJobCoordinator instead. This simplified workflow addresses security requirements where job submissions need to be isolated in order to execute user code as well as operational pain points where deployment failure could happen at multiple places.
Full list of the jiras addressed in this release can be found here.
Upgrading your application to Apache Samza 1.5.0
ConfigFactory is deprecated as Job Runner does not load full job config anymore. Instead, ConfigLoaderFactory is introduced to be executed on ClusterBasedJobCoordinator to fetch full job config. If you are using the default PropertiesConfigFactory, simply switching to use the default PropertiesConfigLoaderFactory will work, otherwise if you are using a custom ConfigFactory, kindly creates its new counterpart following ConfigLoaderFactory.
Configs related to job submission must be explicitly provided to Job Runner as it is no longer loading full job config anymore.
Simplify Job Runner
SAMZA-2488 Add JobCoordinatorLaunchUtil to handle common logic when launching job coordinator
SAMZA-2471 Simplify CommandLine
SAMZA-2458 Update ProcessJobFactory and ThreadJobFactory to load full job config
SAMZA-2453 Update ClusterBasedJobCoordinator to support Beam jobs
SAMZA-2441 Update ApplicationRunnerMain#ApplicationRunnerCommandLine not to load local file
SAMZA-2420 Update CommandLine to use config loader for local config file
Container Placement API
SAMZA-2402 Tie Container placement service and Container placement handler and validate placement requests
SAMZA-2379 Support Container Placements for job running in degraded state
SAMZA-2378 Container Placements support for Standby containers enabled jobs
Bug Fixes
SAMZA-2515 Thread safety for Kafka consumer in KafkaConsumerProxy
SAMZA-2511 Handle container-stop-fail in case of standby container failover
SAMZA-2510 Incorrect shutdown status due to race between runloop thread and process callback thread
SAMZA-2506 Inconsistent end of stream semantics in SystemStreamPartitionMetadata
SAMZA-2464 Container shuts down when task fails to remove old state checkpoint dirs
SAMZA-2468 Standby container needs to respond to shutdown request
Other Improvements
SAMZA-2519 Support duplicate timer registration
SAMZA-2508 Use cytodynamics classloader to launch job container
SAMZA-2478 Add new metrics to track key and value size of records written to RocksDb
SAMZA-2462 Adding metric for container thread pool size
Sources downloads
A source download of Samza 1.5.0 is available here, and is also available in Apache’s Maven repository. See Samza’s download page for details and Samza’s feature preview for new features.
Posted at 12:28AM Jul 02, 2020
by Bharath in General |
|
Announcing the release of Samza 1.4
NOTE: We may introduce backward incompatible changes regarding samza job submission in the future 1.5 release. Details can be found on SEP-23: Simplify Job Runner
We are thrilled to announce the release of Apache Samza 1.4.0.
Today, Samza forms the backbone of hundreds of real-time production applications across a multitude of companies, such as LinkedIn, Slack, and Redfin, among many others. Samza provides leading support for large-scale stateful stream processing with:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
- High level API for expressing complex stream processing pipelines in a few lines of code.
- Beam Samza Runner that marries Beam’s best in class support for EventTime based windowed processing and sophisticated triggering with Samza’s stable and scalable stateful processing model.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
- A Table API that provides a common abstraction for accessing remote or local databases and allowing developers are able to “join” an input event stream with such a Table.
- Flexible deployment model for running the applications in any hosting environment and with cluster managers other than YARN.
New Features, Upgrades and Bug Fixes:
This release brings the following features, upgrades, and capabilities (highlights):
- Improvements regarding management and monitoring of local state
- Improvements to the Samza SQL API
- New system producer for Azure blob storage
- Bug fixes
Full list of the jiras addressed in this release can be found here.
Upgrading your application to Apache Samza 1.4.0
If an application is being upgraded to Samza 1.4, please note the following usage changes.
- The samza-autoscaling module is no longer supported, and the module has been removed.
State
- SAMZA-2386 Get store names should return correct store names in the presence of side inputs
- SAMZA-2324 Adding KV store metrics for rocksdb
- SAMZA-2416 Adding null-check before incrementing metrics for bytesSerialized
- SAMZA-2397 Samza rocksdb metrics do not emit values after Samza version >= 1.1
- SAMZA-2447 Checkpoint dir removal should only search in valid store dirs
SQL
- SAMZA-2362 Include the ScalarUDF implementations with the configured package prefix in ReflectionBasedUdfResolver.
- SAMZA-2375 Samza-sql: Store udf original name for display purposes
- SAMZA-2376 Samza-sql: Samza sql should handle sql statements with trailing semi-colon (;)
- SAMZA-2396 Support dynamic addition of jars in ReflectionUdfResolver.
- SAMZA-2415 Samza-Sql: Fix AvroRelConverter to only consider cached schema while populating SamzaSqlRelRecord for all the nested records.
- SAMZA-2425 Samza-sql: support subquery in joins
- SAMZA-2455 Validate the argument types in SamzaSQL UDF on execution planning phase
Azure Bob Storage system producer
- SAMZA-2421 Add SystemProducer for Azure Blob Storage
Job coordinator dependency isolation (experimental)
- SAMZA-2421 Add SystemProducer for Azure Blob Storage
- SAMZA-2332 [AM isolation] YarnJob should pass new command and additional environment variables for AM deployment
- SAMZA-2333 [AM isolation] Use cytodynamics classloader to launch job coordinator
Bug fixes
- SAMZA-2334 ProxyGrouper selection based on Host Affinity not whether job is stateful
- SAMZA-2372 Null pointer exception in LocalApplicationRunner
- SAMZA-2443 Upgrade Jetty version to prevent AM file descriptor leak
- SAMZA-2446 Invoke onCheckpoint only for registered SSPs
- SAMZA-2463 Duplicate firings of processing timers
- SAMZA-2461 Fix Concurrent Modification Exception in InMemorySystem
Other improvements
- SAMZA-2364 Include the localized resource lib directory in the classpath of SamzaContainer
- Clean up unused org.apache.samza.autoscaling module
- SAMZA-2444 JobModel save in CoordinatorStreamStore resulting flush for each message
- SAMZA-2452 Adding internal autosizing related configs
Sources downloads
A source download of Samza 1.4.0 is available here, and is also available in Apache’s Maven repository. See Samza’s download page for details and Samza’s feature preview for new features.
Posted at 12:19AM Mar 19, 2020
by pmaheshwari in General |
|
Announcing the release of Samza 1.3.1
We have identified some issues with the previous release of Apache Samza 1.3.0. To addressed those identified problems, we have released Apache Samza 1.3.1 with the specific bug fixes listed below:
- SAMZA-2447 Checkpoint dir removal should only search in valid store dirs (#1261)
- SAMZA-2446 Invoke onCheckpoint only for registered SSPs (#1260)
- SAMZA-2431 Fix the checkpoint and changelog topic auto-creation. (#1251)
- SAMZA-2434 Fix the coordinator steam creation workflow
- SAMZA-2423 Heartbeat failure causes incorrect container shutdown (#1240)
- SAMZA-2305 Stream processor should ensure previous container is stopped during a rebalance (#1213)
Sources downloads
A source download of Samza 1.3.1 is available here, and is also available in Apache’s Maven repository. Samza’s download page for details and Samza’s feature preview for new features.
Posted at 04:51PM Feb 20, 2020
by Hai Lu in General |
|
Announcing the release of Samza 1.3
We are thrilled to announce the release of Apache Samza 1.3.0
Today Samza forms the backbone of hundreds of real-time production applications across a multitude of companies, such as LinkedIn, VMWare, Slack, Redfin among many others. This release of Samza adds a variety of features and capabilities to Samza’s existing arsenal, coupled with improved documentation, code snippets, examples. Samza provides leading support for large-scale stateful stream processing with:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
- High level API for expressing complex stream processing pipelines in a few lines of code.
- Beam Samza Runner that marries Beam’s best in class support for EventTime based windowed processing and sophisticated triggering with Samza’s stable and scalable stateful processing model.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
- A Table API that provides a common abstraction for accessing remote or local databases and allowing developers are able to "join" an input event stream with such a Table.
- Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
- Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
New Features, Upgrades and Bug Fixes
This release brings the following features, upgrades, and capabilities (highlights):- Startpoint support improvement
- Samza SQL improvement
- Table API improvement
- Miscellaneous bug fixes
Startpoint support improvement
- SAMZA-2201 Startpoints - Integrate fan out with job coordinators
- SAMZA-2215 StartpointManager fix for previous CoordinatorStreamStore refactor
- SAMZA-2220 Startpoints - Fully encapsulate resolution of starting offsets in OffsetManager
Samza SQL improvement
- SAMZA-2234 Samza SQL : Provide access to Samza context to the Sama SQL UDFs
- SAMZA-2313 Samza-sql: Add validation for Samza sql statements
- SAMZA-2354 Improve UDF discovery in samza-sql
Table API improvement
- SAMZA-2191 support batching for remote tables
- SAMZA-2200 Update table sendTo() and join() operation to accept additional arguments
- SAMZA-2219 Add a dummy table read function
- SAMZA-2309 Remote table descriptor requires read function
Miscellaneous bug fixing
- SAMZA-2198 Containers process always takes task.shutdown.ms to shut down
- SAMZA-2293 Propagate the watermark future to StreamOperatorTask correctly
Important Announcement
We may introduce a backward incompatible changes regarding samza job submission in the future 1.4 release. Details can be found on SEP-23: Simplify Job Runner
Sources downloads
A source download of Samza 1.3.0 is available here, and is also available in Apache’s Maven repository. Samza’s download page for details and Samza’s feature preview for new features.
Contribute
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs.
Posted at 09:16AM Dec 10, 2019
by Hai Lu in General |
|
Announcing the release of Samza 1.1
We are thrilled to announce the release of Apache Samza 1.1.0
Today Samza forms the backbone of hundreds of real-time production applications across a multitude of companies, such as LinkedIn, VMWare, Slack, Redfin among many others. This release of Samza adds a variety of features and capabilities to Samza’s existing arsenal, coupled with improved documentation, code snippets, examples. Samza provides leading support for large-scale stateful stream processing with:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
- High level API for expressing complex stream processing pipelines in a few lines of code.
- Beam Samza Runner that marries Beam’s best in class support for EventTime based windowed processing and sophisticated triggering with Samza’s stable and scalable stateful processing model.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
- A Table API that provides a common abstraction for accessing remote or local databases and allowing developers are able to "join" an input event stream with such a Table.
- Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
- Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
New Features, Upgrades and Bug Fixes
This release brings the following features, upgrades, and capabilities: * We have created a new Samza Stream Processing video series on Youtube * New and improved documentation, code snippets, and examples for using the latest version of Samza with Apache Beam (Code samples are here: https://github.com/apache/samza-beam-examples)API enhancements and simplifications:
- SAMZA-1981 Consolidate table descriptors to samza-api.
- SAMZA-1998 Table API refactoring.
- SAMZA-1980 Rename LocalStoreBackedTable to LocalTable.
- SAMZA-2043 Consolidate ReadableTable and ReadWriteTable.
- SAMZA-2012 Add API for wiring an external context through to application processing.
- SAMZA-2026 Refactor remote table API to separate retry policy settings.
- SAMZA-2041 Add system descriptors for HDFS and Kinesis.
- SAMZA-2081 Samza SQL: Type system for Samza SQL.
- SAMZA-2106 Samza App & Job Config Refactor.
State Store Restoration:
- SAMZA-2018 State restore improvements using RocksDB writebatch API.
Standalone Improvements:
- SAMZA-1973 Unify the TaskNameGrouper interface for yarn and standalone.
- SAMZA-1952 StreamPartitionCountMonitor for standalone.
Other Upgrades and Bug-fixes:
- SAMZA-1638 Recreate SystemProducer on KafkaCheckpointManager.writeCheckpoint failure.
- SAMZA-1946 Problem with Race between TimerListener initialization and timers fired from init().
- SAMZA-2004 Add ability to disable table metrics.
- SAMZA-2013 Account for cycles in graph traversal within Execution Planner.
- SAMZA-2015 Refactor timer handling in tables to be consistent with stores.
- SAMZA-2072 Update guava to 23.0.
- SAMZA-2090 Fix flush behavior for remote and hybrid tables.
- SAMZA-2108 Check for host affinity config before resolving preferred host matching.
- SAMZA-2109 Reduce default-buffer sizes for per-partition queues.
- SAMZA-2118 Improve the shutdown sequence of AsyncRunLoop.
- SAMZA-2119 Upgrading yarn-client version to 2.7.1.
- SAMZA-2122 Fix the task caught-up logic which doesn't handle no incoming messages
API Updates
The following imports for Table API have been updated:- Rename the import org.apache.samza.storage.kv.descriptors.BaseLocalStoreBackedTableDescriptor to org.apache.samza.storage.kv.descriptors.BaseLocalTableDescriptor
- Rename the import org.apache.samza.table.remote.descriptors.RemoteTableDescriptor to org.apache.samza.table.descriptors.RemoteTableDescriptor
- Rename the import org.apache.samza.table.caching.descriptors.CachingTableDescriptor to org.apache.samza.table.descriptors.CachingTableDescriptor
Configurations Updates
The job.name and job.id configs are now deprecated in favor of app.name and app.id configs respectively. A source download of Samza 1.1.0 is available here, and is also available in Apache’s Maven repository. Samza’s download page for details and Samza’s feature preview for new features.Community Developments
A Stream Processing with Apache Kafka & Apache Samza meetup/symposium that was held on March 20th which had following presentation for Samza:- Apache Samza 1.0: Recent Advances and our plans for future in Stream Processing
Contribute
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs. I’d like to close by thanking everyone who’s been involved in the project. It’s been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 09:48PM Mar 12, 2019
by xinyu in General |
Comments [27]
|
Announcing the release of Samza 1.0
We’re thrilled to announce to the release of Apache Samza 1.0.
Today Samza forms the backbone of hundreds of real-time production applications across a multitude of companies, such as LinkedIn, VMWare, Slack, Redfin among many others. This release of Samza adds a variety of features and capabilities to Samza’s existing arsenal, coupled with new and improved documentation, code snippets, examples, and a brand-new website design! Here are a few selected highlights:
Stable high level APIs that allow creating complex processing pipelines with ease.
Beam Samza Runner now marries Beam’s best in class support for EventTime based windowed processing and sophisticated triggering with Samza’s stable and scalable stateful processing model.
Table API that provides a common abstraction for accessing remote or local databases. Developers are now able to “join” an input event stream with such a Table.
Integration Test Framework to enable effortless testing of Samza jobs without deploying a Kafka, Yarn, or Zookeeper cluster.
Support for Apache Log4j2 allowing improved logging performance, customization, and efficiency.
Upgraded Kafka client and consumer.
An interactive shell for Samza SQL for seamless formulation, development, and testing of SamzaSQL queries.
Side-input support that allows using log-compacted data sources to populate KV state for Samza applications.
An improved website with detailed documentation and lots of code samples!
In addition, Samza 1.0 brings numerous bug-fixes, upgrades, and improvements listed below.
New features
Samza 1.0 brings full-feature support for the following:
Improved Stable High Level APIs
Samza 1.0 brings Descriptor APIs that allows applications to specify their input and output systems and streams in code. Samza’s new Context APIs provide applications unified access to job-level, container-level, task-level, and application-level context and capabilities. This also simplifies Samza’s ApplicationRunner interface.
This API evolution requires a few simple modifications to application code, which we describe in detail in our upgrade steps
Beam Runner Support
Samza’s Beam Runner enables executing Beam pipelines over Samza. This enables Samza applications to create complex processing pipelines that require event-time based processing, varying types of event-time based windowing, and more. This feature is supported in both the YARN and standalone deployment models.
Joining Streams and Tables
Samza’s Table API provides developers with unified access to local and remote data sources such as Key-Value stores or web services, while providing features such as rate-limiting, throttling, and caching capabilities. This provides first-class API primitives for building Stream-Table join jobs. Learn more about the use, semantics, and examples for Table API here.
Test Samza without ZK, Yarn or Kafka
Samza 1.0 brings a test framework that allows testing Samza applications using in-memory input and output. Users can now setup test and testing pipelines for their applications without needing to setup any other services, such as Kafka, YARN, or Zookeeper.
Log4J2 support
Samza now supports Apache Log4j 2 for system and application logging. Log4j 2 is an upgrade to Log4j that provides significant improvements over its predecessor, Log4j 1.x, such as better throughput and latency, custom log levels, and a pluggable logging architecture.
Kafka upgrade
This release upgrades Samza to use Kafka’s high-level consumer (Kafka v0.11.1.62). This brings latency and throughput benefits for Samza applications that consume from Kafka, in addition to bug-fixes. This also means Samza applications can now better their utilization of the underlying Kafka cluster.
SamzaSQL Shell
SamzaSQL now provides a shell for users to type-in their SQL queries, while Samza does the heavy-lifting of wiring the inputs and outputs, and sizing the application in the background. This is great for testing and experimenting with queries while formulating your application-logic, specially suited for data-scientists and tinkerers.
Side-inputs
Samza 1.0 brings the ability to leverage existing log-compacted data sources (e.g., Kafka topics) to populate KV state for Samza applications. If your data processing pipeline involves Hadoop-to-Kafka push, this feature alleviates the need for your Samza job to create separate Kafka-topics to back KV state.
Improved website, documentation, and samples
We’ve re-designed the Samza website making it easier to find details on key Samza concepts and patterns. All documentation has been revised and rewritten, keeping in mind the feedback we got from our customers. We’ve revised and added sample application code to showcase Samza 1.0 and the use of its new APIs.
Enhancements and Upgrades
This release brings the following enhancements, upgrades, and capabilities:
API enhancements and simplifications
SAMZA-1789: unify ApplicationDescriptor and ApplicationRunner for high- and low-level APIs in YARN and standalone environment
SAMZA-1804: System and stream descriptors
SAMZA-1858: Public APIs for shared context
SAMZA-1763: Add async methods to Table API
SAMZA-1786: Introduce the metadata store abstraction
SAMZA-1859: Zookeeper implementation of MetadataStore
SAMZA-1788: Add the LocationIdProvider abstraction
Upgrades and Bug-fixes
SAMZA-1768: Handle corrupted OFFSET file
SAMZA-1817: Long classpath support for non-split deployments SAMZA-1719: Add caching support to table-API
SAMZA-1783: Add Log4j2 functionality in Samza
SAMZA-1868: Refactor KafkaSystemAdmin from using SimpleConsumer
SAMZA-1776: Refactor KafkaSystemConsumer to remove the usage of deprecated SimpleConsumer client
SAMZA-1730: Adding state validation in StreamProcessor before any lifecycle operation and group coordination
SAMZA-1695: Clear events in ScheduleAfterDebounceTime on session expiration
SAMZA-1647: Fix race conditions in StreamProcessor
SAMZA-1371: Some Samza Containers get stuck at \“Starting BrokerProxy\”
SAMZA-1648: Integration Test Framework & Collection Stream Impl
SAMZA-1748: Failure tests in the standalone deployment
A source download of Samza 1.0 is available here, and in Apache’s Maven repository.
Community Developments
A symposium on Stream processing with Apache Samza and Apache Kafka was held on July 19th and on October 23rd. Both were attended by more than 350 participants from across the industry. It featured in-depth talks on Samza’s Beam integration, its use at LinkedIn for real-time notifications, a talk on Kafka-replication at Uber, and Kafka cruise control, and many others.Samza was also the focus of a talk at Strange Loop'18, focussing in depth on its scalability, performance, extensibility, and programmability.
Posted at 09:26AM Nov 27, 2018
by jagadish in General |
Comments [32]
|
Announcing the release of Apache Samza 0.14.1
We are very excited to announce the release of Apache Samza 0.14.1
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber, Slack, Redfin, TripAdvisor, etc) for years now. Samza provides leading support for large-scale stateful stream processing with:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
- High level API for expressing complex stream processing pipelines in a few lines of code.
- Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
- Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
Enhancements, Upgrades and Bug Fixes
This is a minor release which contains improvements over multiple areas. In particular:- SQL
- SAMZA-1681 Add support for handling older record schema versions in AvroRelConverter
- SAMZA-1671 Add insert into table support
- SAMZA-1651 Implement GROUP BY SQL operator
- Standalone
- SAMZA-1689 Add validations before state transitions in ZkBarrierForVersionUpgrade
- SAMZA-1686 Set finite operation timeout when creating zkClient
- SAMZA-1667 Skip storing configuration as a part of JobModel in zookeeper data nodes
- SAMZA-1647 Fix NPE in JobModelExpired event handler
- Eventhub
- SAMZA-1688 Use per partition eventhubs client
- SAMZA-1676 Miscellaneous fix and improvement for eventhubs system
- SAMZA-1656 EventHubSystemAdmin does not fetch metadata for valid streams
- Host-affinity
- SAMZA-1687 Prioritize preferred host requests over ANY-HOST requests
- SAMZA-1649 Improve host-aware allocation to account for strict locality
Overall, 51 JIRAs were resolved in this release. A source download of the 0.14.1 release is available here. The release JARs are also available in Apache’s Maven repository. See Samza’s download page for details and Samza’s feature preview for new features. We requires JDK version newer than 1.8.0_111 when running 0.14.1 release for users who are using Scala 2.12.
Community Developments
In March 21th, we held the meetup for Stream Processing with Apache Kafka & Apache Samza, which has the following presentations for Samza:- Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza (Slides)
- Building Venice with Apache Kafka & Samza
Contribute
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs.I’d like to close by thanking everyone who’s been involved in the project. It’s been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 11:03PM May 25, 2018
by xinyu in General |
Comments [31]
|
Announcing the release of Apache Samza 0.14.0
We are very excited to announce the release of Apache Samza 0.14.0
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber, Slack, Redfin, TripAdvisor, etc) for years now. Samza provides leading support for large-scale stateful stream processing with:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
- High level API for expressing complex stream processing pipelines in a few lines of code.
- Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
- Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
New Features, Upgrades and Bug Fixes
The 0.14.0 release contains the following highly anticipated features:- Samza SQL
- Azure EventHubs producer, consumer and checkpoint provider
- AWS Kinesis consumer
Overall, 65 JIRAs were resolved in this release. For more details about this release, please check out the release notes.
Community Developments
We’ve made great community progress since the last release (0.13.1). We presented the unified data processing with Samza at the 2017 Big Data conference held in Spain and the Dataworks Summit in Sydney, and held a demo at @scale conference in San Jose. Here are the details to these conferences.- Nov 17, 2017 - Unified Stream Processing at Scale with Apache Samza (BigDataSpain 2017) (Slides)
- Sept 21, 2017 - Unified Batch & Stream Processing with Apache Samza (Dataworks Summit Sydney 2017) (Slides)
- Aug 31, 2017 - Demo of Stream Processing@LinkedIn (@scale conference 2017) (Slides)
Contribute
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs.I’d like to close by thanking everyone who’s been involved in the project. It’s been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 05:38PM Jan 05, 2018
by xinyu in General |
Comments [41]
|
Announcing the release of Apache Samza 0.13.1
We are very excited to announce the release of Apache Samza 0.13.1
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber) for years now. Samza provides leading support for large-scale stateful stream processing with:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
- High level API for expressing complex stream processing pipelines in few lines of code.
- Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
- Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
Enhancements, Upgrades and Bug Fixes
This is a stability release to make Samza as an embedded library production ready. Samza as a library is part of Samza’s Flexible Deployment model; release fixes a number of outstanding bugs includes the following enhancements to existing features:- SAMZA-1165 Cleanup data created by ZkStandalone in ZK
- SAMZA-1324 Add a metrics reporter lifecycle for JobCoordinator component of StreamProcessor
- SAMZA-1336 Standalone session expiration propagation
- SAMZA-1337 LocalApplicationRunner supports StreamTask
- SAMZA-1339 Add standalone integration tests
- SAMZA-1282 Fix killed leader process issue when spinning up more containers than the number of tasks kills leader
- SAMZA-1340 StreamProcessor does not propagate container failures from StreamTask
- SAMZA-1346 GroupByContainerCount.balance() should guard against null LocalityManager
- SAMZA-1347 GroupByContainerIds NPE if containerIds list is null
- SAMZA-1358 task.class empty string should be ignored when app.class is configured
- SAMZA-1361 OperatorImplGraph used wrong keys to store/retrieve OperatorImpl in the map
- SAMZA-1366 ScriptRunner should allow callers to control the child process environment
- SAMZA-1384 Race condition with async commit affects checkpoint correctness
- SAMZA-1385 Fix coordination issues during stream creation in LocalApplicationRunner
A source download of the 0.13.1 release is available here. The release JARs are also available in Apache’s Maven repository. See Samza’s download page for details and Samza’s feature preview for new features. We requires JDK version newer than 1.8.0_111 when running 0.13.1 release for users who are using Scala 2.12.
Community Developments
We’ve made great community progress since the last release (0.13.0). We presented Samza high level API features at the Cloud+Data NEXT Conference 2017 held in Silicon Valley, USA, and also gave a talk regarding the key features (Secret Kung Fu) of Samza at ArchSummit 2017 in Shenzhen, China, and a detailed study of stateful stream processing in VLDB 2017. Here are the details to these conferences.- July 15, 2017 - Unified Processing with the Samza High-level API (Cloud+Data NEXT Conference, Silicon Valley) (slides)
- July 7, 2017 - Secret Kung Fu of Massive Scale Stream Processing with Apache Samza - Xinyu Liu (ArchSummit, Shenzhen, 2017)
- Aug 28, 2017 - Samza: Stateful Scalable Stream Processing at LinkedIn - Kartik Paramasivam (ACM VLDB, Munich, 2017)
As future development, we are continuing working on improving the new High Level API and flexible deployment features. Here is the list of the tasks for upcoming features and improvements.
Upcoming Samza Meetup
Don’t miss out the upcoming meetup on September 12, 2017. Sign up now!Contribute
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs.I’d like to close by thanking everyone who’s been involved in the project. It’s been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 10:54PM Aug 25, 2017
by navina in General |
Comments [33]
|
Announcing the release of Apache Samza 0.13.0
We are very excited to announce the release of Apache Samza 0.13.0.
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber) for years now. Samza provides leading support for large-scale stateful stream processing with:
• First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
• Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
• A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
• A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
• Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
New features
The 0.13.0 release contains previews for the following highly anticipated features:
High Level API
With the new high level API you can express your complex stream processing pipelines concisely in few lines of code and accomplish what previously required multiple jobs. This new API facilitates common operations like re-partitioning, windowing, and joining streams. Check out some examples to see the high level API in action here
Flexible Deployment Model
Samza now provides flexibility for running your application in any hosting environment and with cluster managers other than YARN. Samza can now also be run as a lightweight stream processing library embedded inside your application. Your processes can coordinate task distribution amongst themselves using ZooKeeper or static partition assignments out-of-the box.
See more details and code examples here.
Enhancements, Upgrades and Bug Fixes
This release also includes the following enhancements to existing features:
- SAMZA-871 adds a heart-beat mechanism between JobCoordinator and all running containers to prevent orphaned containers.
- SAMZA-1140 enables non-blocking commit in the AsyncRunloop.
- SAMZA-1143 adds configurations for localizing general resources in YARN.
- SAMZA-1145 provides the ability to configure the default number of changelog replicas.
- SAMZA-1154 adds a tasks endpoint to samza-rest to get information about all tasks in a job.
- SAMZA-1158 adds a samza-rest monitor to clean up stale local stores from completed containers.
This release also includes several bug-fixes and improvements for operational stability. Some notable ones are:
- SAMZA-1083 prevents loading task stores that are older than delete tombstones during container startup.
- SAMZA-1100 fixes an exception when using an empty stream as both bootstrap and broadcast.
- SAMZA-1112 fixes BrokerProxy to log fatal errors.
- SAMZA-1121 fixes StreamAppender so that it doesn't propagate exceptions to the caller.
- SAMZA-1157 fixes logging for serialization/deserialization errors.
We've also upgraded the following dependency versions:
- Samza now supports Scala 2.12.
- Kafka version to 0.10.1.1.
- Elasticsearch version to 2.2.0
Community Developments
We've made great community progress since the previous release. We showcased how Samza is powering stream processing at LinkedIn in Kafka Summit 2017 and O’Reilly Strata 2017. We also presented Samza use cases and case studies from several large companies in ApacheCon Big Data, 2017. In addition, the Samza talk in LinkedIn's Stream Processing Meetup in Sunnyvale was well-received with over 200 attendees. Here are links to some of these events:
- March 15, 2017 - Processing millions of events per second without breaking the bank - Kartik Paramasivam (Video)
- May 8, 2017 - Data Processing at LinkedIn with Apache Kafka and Apache Samza (Kafka Summit NYC 2017) (Slides)
- May 16, 2017 - What it takes to process a trillion events a day? Case studies in scaling stream processing at LinkedIn - Jagadish Venkatraman (ApacheCon Big Data '17) (Slides)
- May 16, 2017 - The continuing story of Batching to Streaming analytics at Optimizely, Michael Borsuk (ApacheCon Big Data’17) (Slides)
- May 24, 2017 - Managed or stand alone, streaming or batch; Unified processing with the Samza Fluent API - Yi Pan (LinkedIn Stream Processing Meetup) (Slides)
- May 25, 2017 - How companies are using Apache Samza - Jagadish Venkatraman (Apache Con podcast)
Future:
We'll continue improving the new High Level API and flexible deployment features with your feedback.
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs. I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 06:09PM Jun 09, 2017
by nickpan47 in General |
Comments [34]
|
Announcing the release of Apache Samza 0.12.0
We are excited to announce that the Apache Samza 0.12.0 has been released.
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber) for a few years now. Samza provides leading support for large-scale stateful stream processing with features such as:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single SSD based machine.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- Minimal impact during application maintenance.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and outputs (HDFS, Kafka, ElastiCache etc.). This allows applications to directly process data from various event sources without mandating that the data should be moved into Kafka.
- A fully async programming model. This allows applications that make remote calls to increase parallelism very efficiently.
- Features like canaries, upgrades and rollbacks that support extremely large deployments.
Convergence of Batch and Real-time processing in Samza:
End of Stream support: Samza has always supported streaming input sources like Kafka. In such sources, it is assumed that the incoming stream of data is infinite. Samza will now have an ‘end-of-stream’ notion to support consuming from input sources that are finite (for example, on-disk files). This enables the Samza job to shut-down gracefully when it has finished consuming all data.
HDFS Consumer: Samza now provides first-class support for consuming data from HDFS files. This enables developers to define their processing logic once, and run it in both batch and streaming environments. This feature also allows for rapid experimentation with ETL’d HDFS data using Samza without the need to write a separate Hadoop job. (SAMZA-967)
Checkpoint Notifications:
Samza can now notify the SystemConsumer when performing a checkpoint. This can enable Samza to support consumers such as: Amazon Kinesis, Amazon SQS, Azure ServiceBus Queues/Topics, Google Cloud Pub-Sub, ActiveMQ, etc., which each manage checkpointing on their own. This also enables consumers to implement smart retention policies (such as deleting data once it has been consumed). (SAMZA-1042)
Support for Yarn Node Labels:
Often Samza YARN clusters have machines that are not homogenous. For example, nodes could have different memory hardware, CPUs, spinning disks or SSDs. With this feature, users can assign “labels” to nodes in their YARN cluster and use them to specify the where their Samza job should run. This feature allows flexibility in scheduling jobs based on trade-offs in resource requirements, performance and hardware costs. For example, stateful jobs can be configured to run on nodes with SSDs while stateless jobs can be configured to run on nodes with spinning disks. (SAMZA-1013)
Bug fixes:
This release also includes several critical bug-fixes and improvements for operational stability. Some notable ones include:
- HttpFileSystem timeout for blocking reads when localizing containers (SAMZA-1079).
- SamzaContainer should catch all Throwables instead of only exceptions (SAMZA-1077).
- Deadlock between KafkaSystemProducer and KafkaProducer from kafka-clients lib (SAMZA-1069).
- Change the commit order to support at least once processing when deduping with local store (SAMZA-1065).
- Upgraded Kafka version to 0.10. This enables us to take advantage of the critical fixes and improvements in Kafka.
- Upgraded to Jetty 9 from Jetty 8.
- Full support for Scala 2.11. All Samza jars will now have the scala version as 2.11 as a part of their file name. For example, samza-yarn_2.11-0.12.jar.
- Samza is now source compatible with JDK 8 and above. Older JDKs are no longer supported.
We made great community progress since the last release. We had two successful meetups where we presented Samza’s roadmap, and how Optimizely uses Samza. Several Samza use-cases in Uber and LinkedIn were featured in QCon 2016.
- Conferences and talks:
- QCon November 2016 : Scaling up Near real-time Analytics
- Samza meetup Nov 2016: Apache Samza: Past, Present, and Future
- Samza meetup Feb 2017: Batch to Streaming analytics at Optimizely
- Samza meetup Feb 2017: Async processing and multi-threading in Samza
- The entire list of links to other presentations can be found here
- Blogs:
There are a lot of exciting features to expect in our future release. Here are some highlights:
- Support for Disk quota enforcement and throttling (SAMZA-956)
- Support for high-level programming API for stream processing (SAMZA-1073)
- Support for running Samza in stand-alone mode (SAMZA-516)
Posted at 09:29PM Feb 22, 2017
by jagadish in General |
Comments [83]
|
Announcing the release of Apache Samza 0.11.0
We are excited to announce that the Apache Samza 0.11.0 has been released.
Samza is a stable and mature Stream processing framework that has been powering real time applications across various companies in production for a few years now. Samza has industry leading support for stateful stream processing with cutting edge features like
- Support for RocksDB based local state.
- Incremental state checkpointing: This feature is unique compared to existing stream processing frameworks and allows Samza to support applications with large state very elegantly.
- Minimal impact during application upgrades by minimizing state movement.
The 0.11.0 release packs up several large improvements in runtime performance, operational stability and ease of use. Some of the key highlights include
- Asynchronous API and processing (SAMZA-863, doc): Prior to this release, Samza only supported a synchronous single threaded process model. Increasing the number of containers (processes) to increase parallelism required a lot more memory resources. This inefficiency was more obvious for applications that make remote calls to external services/databases. With this new feature an application can increase parallelism very efficiently within a single container (process). In addition to a parallel processing model we now also support a purely asynchronous processing model which makes it a lot more efficient to perform remote I/O. In the absence of this support for async processing model, samza applications that wanted to process messages asynchronously would also had to handle the additional complexity of managing checkpointing (by disabling auto-checkpointing in Samza). With the new support for async processing, Samza will make sure that checkpointing is automatically performed only after the async calls have completed.
- Separate Samza framework deployment from user jobs (SAMZA-849, doc): Typically in a large organization the team that manages the Samza cluster is not the same as the teams that are running applications on top of Samza. This feature allows upgrading the Samza framework without forcing developers to explicitly upgrade their running applications. With simple config changes, it supports canary, upgrade and rollback scenarios commonly required in organizations that run tens or hundreds of jobs.
- Samza Rest API (SAMZA-865, doc): The REST API provides a rich set of operations for the users to interact with their running jobs. Samza REST API allows you to start, stop and list jobs, and also run periodic monitoring scripts. This API can be integrated with deployment tooling and job dashboard for better job management.
- Disk monitoring (SAMZA-924): A Samza YARN cluster is used to run several stream processing applications on a shared set of physical machines. In such a multi-tenant environment it is critical to have some limits on the amount of disk space used by each job, especially to store application state. This feature introduces the measurement of the disk usage for selected job directories. The disk space usage will be gathered periodically and reported to Samza metrics. In the next release this feature will be extended to also enforce the disk quotas.
- New metrics to troubleshoot and monitor performance issues: SAMZA-972 added holistic monitoring of memory in Samza applications. With SAMZA-963 we added the ability to troubleshoot performance issues better by isolating the time spent in the application from the time spent in accessing state.
A source download of the 0.11.0 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.
Project Status
A total of 62 contributors have contributed to the Samza Project so far. In this release 21,473 lines of code were added/changed.
With this release we also add 3 new committers to the Apache Samza community.
Recent Community Activities
There has been a lot of activities from the community during this release time frame. Here are links to some of them.
- Conferences:
- Stream processing Meetup @ LinkedIn
- Detailed list of links to other presentations can be found here
- Blogs:
Contribute!
There are a lot more exciting features to expect in our future release. Some of them are:
- Samza operators API (SAMZA-914)
- HDFS system consumer (SAMZA-967)
- Support for standalone Samza jobs (SAMZA-516)
- Disk quotas enforcement (SAMZA-956)
I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 08:20PM Oct 24, 2016
by xinyu in General |
Comments [62]
|
Announcing the release of Apache Samza 0.10.1
I am excited to announce that the Apache Samza 0.10.1 has been released. This is our fourth release as an Apache Top-level Project!
Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It was originally created at LinkedIn and still continues to be used in production. The project is currently under active development with contributions from a diverse group of contributors and committers. Samza still continues to be used in production by many companies (such as Netflix, Uber, TripAdvisor etc. See PoweredBy) in the industry.
A source download of the 0.10.1 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.
Overall, 72 JIRAs were resolved in this release. This is a minor release consisting of some bug-fixes and robust improvements to features like coordinator stream, host-affinity etc. Samza continues to require Java 1.7+ and Yarn 2.6.1+.
A few notable enhancements are:
- Support static partition assignment in ProcessJobFactory (SAMZA-41)
- Slow start of Samza jobs with large number of containers (SAMZA-843)
- Change log not working properly with In memory Store (SAMZA-889)
- Refactor and fix Container allocation logic (SAMZA-866)
- Detect partition count changes in input streams (SAMZA-882)
- Host Affinity - State restore doesn't work if the previous shutdown was uncontrolled (continuous offset) (SAMZA-905)
- Broadcast stream is not added properly in the prioritized tiers in the DefaultChooser (SAMZA-944)
- Improve the performance of the continuous OFFSET checkpointing for logged stores (SAMZA-964)
- Host Affinity - Minimize task reassignment when container count changes (SAMZA-906)
- Improve event loop timing metrics (SAMZA-951)
- Avoid unnecessary flushes in CachedStore (SAMZA-873)
- Incompatible change in Kafka producer that does not honor custom partitioners (SAMZA-839)
We've also made a lot of community progress during this release:
- We had 2 successful meetups - one in February and the other in June. The upcoming meetup is scheduled for August 23.
- Apache Samza was presented at the Apache Big Data (North America) conference in May 2016 and at the Hadoop Summit in June 2016. Check out the content here.
- Samza paper/workshop was also accepted at notable academic conferences:
- SamzaSQL: Scalable Fast Data Management with Streaming SQL presented at IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) in May 2016
- Effective Multi-stream Joining in Apache Samza Framework in 5th IEEE International Congress on Big Data, June 27 - July 2, 2016, San Francisco, USA
- 380 emails sent to the developer mailing list in past 3 months
- Support multi-threading in samza tasks (SAMZA-863)
- Disk Quotas: Add throttler and disk quota enforcement (SAMZA-956)
- REST API for starting and stopping Samza jobs (SAMZA-865)
- Samza standalone mode (SAMZA-516)
- High-level language for Samza (SAMZA-390)
It’s a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs. Also, don’t miss out the upcoming meetup on August 23. Sign up now!
I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 12:30AM Aug 10, 2016
by navina in General |
Comments [31]
|