Entries tagged [samza]
Announcing the release of Apache Samza 0.14.1
We are very excited to announce the release of Apache Samza 0.14.1
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber, Slack, Redfin, TripAdvisor, etc) for years now. Samza provides leading support for large-scale stateful stream processing with:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
- High level API for expressing complex stream processing pipelines in a few lines of code.
- Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
- Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
Enhancements, Upgrades and Bug Fixes
This is a minor release which contains improvements over multiple areas. In particular:- SQL
- SAMZA-1681 Add support for handling older record schema versions in AvroRelConverter
- SAMZA-1671 Add insert into table support
- SAMZA-1651 Implement GROUP BY SQL operator
- Standalone
- SAMZA-1689 Add validations before state transitions in ZkBarrierForVersionUpgrade
- SAMZA-1686 Set finite operation timeout when creating zkClient
- SAMZA-1667 Skip storing configuration as a part of JobModel in zookeeper data nodes
- SAMZA-1647 Fix NPE in JobModelExpired event handler
- Eventhub
- SAMZA-1688 Use per partition eventhubs client
- SAMZA-1676 Miscellaneous fix and improvement for eventhubs system
- SAMZA-1656 EventHubSystemAdmin does not fetch metadata for valid streams
- Host-affinity
- SAMZA-1687 Prioritize preferred host requests over ANY-HOST requests
- SAMZA-1649 Improve host-aware allocation to account for strict locality
Overall, 51 JIRAs were resolved in this release. A source download of the 0.14.1 release is available here. The release JARs are also available in Apache’s Maven repository. See Samza’s download page for details and Samza’s feature preview for new features. We requires JDK version newer than 1.8.0_111 when running 0.14.1 release for users who are using Scala 2.12.
Community Developments
In March 21th, we held the meetup for Stream Processing with Apache Kafka & Apache Samza, which has the following presentations for Samza:- Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza (Slides)
- Building Venice with Apache Kafka & Samza
Contribute
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs.I’d like to close by thanking everyone who’s been involved in the project. It’s been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 11:03PM May 25, 2018
by xinyu in General |
Comments [31]
|
Announcing the release of Apache Samza 0.14.0
We are very excited to announce the release of Apache Samza 0.14.0
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber, Slack, Redfin, TripAdvisor, etc) for years now. Samza provides leading support for large-scale stateful stream processing with:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
- High level API for expressing complex stream processing pipelines in a few lines of code.
- Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
- Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
New Features, Upgrades and Bug Fixes
The 0.14.0 release contains the following highly anticipated features:- Samza SQL
- Azure EventHubs producer, consumer and checkpoint provider
- AWS Kinesis consumer
Overall, 65 JIRAs were resolved in this release. For more details about this release, please check out the release notes.
Community Developments
We’ve made great community progress since the last release (0.13.1). We presented the unified data processing with Samza at the 2017 Big Data conference held in Spain and the Dataworks Summit in Sydney, and held a demo at @scale conference in San Jose. Here are the details to these conferences.- Nov 17, 2017 - Unified Stream Processing at Scale with Apache Samza (BigDataSpain 2017) (Slides)
- Sept 21, 2017 - Unified Batch & Stream Processing with Apache Samza (Dataworks Summit Sydney 2017) (Slides)
- Aug 31, 2017 - Demo of Stream Processing@LinkedIn (@scale conference 2017) (Slides)
Contribute
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs.I’d like to close by thanking everyone who’s been involved in the project. It’s been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 05:38PM Jan 05, 2018
by xinyu in General |
Comments [41]
|
Announcing the release of Apache Samza 0.13.1
We are very excited to announce the release of Apache Samza 0.13.1
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber) for years now. Samza provides leading support for large-scale stateful stream processing with:
- First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
- Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
- A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
- A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
- High level API for expressing complex stream processing pipelines in few lines of code.
- Flexible deployment model for running the the applications in any hosting environment and with cluster managers other than YARN.
- Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
Enhancements, Upgrades and Bug Fixes
This is a stability release to make Samza as an embedded library production ready. Samza as a library is part of Samza’s Flexible Deployment model; release fixes a number of outstanding bugs includes the following enhancements to existing features:- SAMZA-1165 Cleanup data created by ZkStandalone in ZK
- SAMZA-1324 Add a metrics reporter lifecycle for JobCoordinator component of StreamProcessor
- SAMZA-1336 Standalone session expiration propagation
- SAMZA-1337 LocalApplicationRunner supports StreamTask
- SAMZA-1339 Add standalone integration tests
- SAMZA-1282 Fix killed leader process issue when spinning up more containers than the number of tasks kills leader
- SAMZA-1340 StreamProcessor does not propagate container failures from StreamTask
- SAMZA-1346 GroupByContainerCount.balance() should guard against null LocalityManager
- SAMZA-1347 GroupByContainerIds NPE if containerIds list is null
- SAMZA-1358 task.class empty string should be ignored when app.class is configured
- SAMZA-1361 OperatorImplGraph used wrong keys to store/retrieve OperatorImpl in the map
- SAMZA-1366 ScriptRunner should allow callers to control the child process environment
- SAMZA-1384 Race condition with async commit affects checkpoint correctness
- SAMZA-1385 Fix coordination issues during stream creation in LocalApplicationRunner
A source download of the 0.13.1 release is available here. The release JARs are also available in Apache’s Maven repository. See Samza’s download page for details and Samza’s feature preview for new features. We requires JDK version newer than 1.8.0_111 when running 0.13.1 release for users who are using Scala 2.12.
Community Developments
We’ve made great community progress since the last release (0.13.0). We presented Samza high level API features at the Cloud+Data NEXT Conference 2017 held in Silicon Valley, USA, and also gave a talk regarding the key features (Secret Kung Fu) of Samza at ArchSummit 2017 in Shenzhen, China, and a detailed study of stateful stream processing in VLDB 2017. Here are the details to these conferences.- July 15, 2017 - Unified Processing with the Samza High-level API (Cloud+Data NEXT Conference, Silicon Valley) (slides)
- July 7, 2017 - Secret Kung Fu of Massive Scale Stream Processing with Apache Samza - Xinyu Liu (ArchSummit, Shenzhen, 2017)
- Aug 28, 2017 - Samza: Stateful Scalable Stream Processing at LinkedIn - Kartik Paramasivam (ACM VLDB, Munich, 2017)
As future development, we are continuing working on improving the new High Level API and flexible deployment features. Here is the list of the tasks for upcoming features and improvements.
Upcoming Samza Meetup
Don’t miss out the upcoming meetup on September 12, 2017. Sign up now!Contribute
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs.I’d like to close by thanking everyone who’s been involved in the project. It’s been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 10:54PM Aug 25, 2017
by navina in General |
Comments [33]
|
Announcing the release of Apache Samza 0.13.0
We are very excited to announce the release of Apache Samza 0.13.0.
Samza has been powering real-time applications in production across several large companies (including LinkedIn, Netflix, Uber) for years now. Samza provides leading support for large-scale stateful stream processing with:
• First class support for local state (with RocksDB store). This allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
• Support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
• A fully pluggable model for input sources (e.g. Kafka, Kinesis, DynamoDB streams etc.) and output systems (HDFS, Kafka, ElastiCache etc.).
• A fully asynchronous programming model that makes parallelizing remote calls efficient and effortless.
• Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.
New features
The 0.13.0 release contains previews for the following highly anticipated features:
High Level API
With the new high level API you can express your complex stream processing pipelines concisely in few lines of code and accomplish what previously required multiple jobs. This new API facilitates common operations like re-partitioning, windowing, and joining streams. Check out some examples to see the high level API in action here
Flexible Deployment Model
Samza now provides flexibility for running your application in any hosting environment and with cluster managers other than YARN. Samza can now also be run as a lightweight stream processing library embedded inside your application. Your processes can coordinate task distribution amongst themselves using ZooKeeper or static partition assignments out-of-the box.
See more details and code examples here.
Enhancements, Upgrades and Bug Fixes
This release also includes the following enhancements to existing features:
- SAMZA-871 adds a heart-beat mechanism between JobCoordinator and all running containers to prevent orphaned containers.
- SAMZA-1140 enables non-blocking commit in the AsyncRunloop.
- SAMZA-1143 adds configurations for localizing general resources in YARN.
- SAMZA-1145 provides the ability to configure the default number of changelog replicas.
- SAMZA-1154 adds a tasks endpoint to samza-rest to get information about all tasks in a job.
- SAMZA-1158 adds a samza-rest monitor to clean up stale local stores from completed containers.
This release also includes several bug-fixes and improvements for operational stability. Some notable ones are:
- SAMZA-1083 prevents loading task stores that are older than delete tombstones during container startup.
- SAMZA-1100 fixes an exception when using an empty stream as both bootstrap and broadcast.
- SAMZA-1112 fixes BrokerProxy to log fatal errors.
- SAMZA-1121 fixes StreamAppender so that it doesn't propagate exceptions to the caller.
- SAMZA-1157 fixes logging for serialization/deserialization errors.
We've also upgraded the following dependency versions:
- Samza now supports Scala 2.12.
- Kafka version to 0.10.1.1.
- Elasticsearch version to 2.2.0
Community Developments
We've made great community progress since the previous release. We showcased how Samza is powering stream processing at LinkedIn in Kafka Summit 2017 and O’Reilly Strata 2017. We also presented Samza use cases and case studies from several large companies in ApacheCon Big Data, 2017. In addition, the Samza talk in LinkedIn's Stream Processing Meetup in Sunnyvale was well-received with over 200 attendees. Here are links to some of these events:
- March 15, 2017 - Processing millions of events per second without breaking the bank - Kartik Paramasivam (Video)
- May 8, 2017 - Data Processing at LinkedIn with Apache Kafka and Apache Samza (Kafka Summit NYC 2017) (Slides)
- May 16, 2017 - What it takes to process a trillion events a day? Case studies in scaling stream processing at LinkedIn - Jagadish Venkatraman (ApacheCon Big Data '17) (Slides)
- May 16, 2017 - The continuing story of Batching to Streaming analytics at Optimizely, Michael Borsuk (ApacheCon Big Data’17) (Slides)
- May 24, 2017 - Managed or stand alone, streaming or batch; Unified processing with the Samza Fluent API - Yi Pan (LinkedIn Stream Processing Meetup) (Slides)
- May 25, 2017 - How companies are using Apache Samza - Jagadish Venkatraman (Apache Con podcast)
Future:
We'll continue improving the new High Level API and flexible deployment features with your feedback.
It’s a great time to get involved. You can start by reviewing the tutorials, signing up for the mailing list, and grabbing some newbie JIRAs. I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 06:09PM Jun 09, 2017
by nickpan47 in General |
Comments [34]
|
Announcing the release of Apache Samza 0.11.0
We are excited to announce that the Apache Samza 0.11.0 has been released.
Samza is a stable and mature Stream processing framework that has been powering real time applications across various companies in production for a few years now. Samza has industry leading support for stateful stream processing with cutting edge features like
- Support for RocksDB based local state.
- Incremental state checkpointing: This feature is unique compared to existing stream processing frameworks and allows Samza to support applications with large state very elegantly.
- Minimal impact during application upgrades by minimizing state movement.
The 0.11.0 release packs up several large improvements in runtime performance, operational stability and ease of use. Some of the key highlights include
- Asynchronous API and processing (SAMZA-863, doc): Prior to this release, Samza only supported a synchronous single threaded process model. Increasing the number of containers (processes) to increase parallelism required a lot more memory resources. This inefficiency was more obvious for applications that make remote calls to external services/databases. With this new feature an application can increase parallelism very efficiently within a single container (process). In addition to a parallel processing model we now also support a purely asynchronous processing model which makes it a lot more efficient to perform remote I/O. In the absence of this support for async processing model, samza applications that wanted to process messages asynchronously would also had to handle the additional complexity of managing checkpointing (by disabling auto-checkpointing in Samza). With the new support for async processing, Samza will make sure that checkpointing is automatically performed only after the async calls have completed.
- Separate Samza framework deployment from user jobs (SAMZA-849, doc): Typically in a large organization the team that manages the Samza cluster is not the same as the teams that are running applications on top of Samza. This feature allows upgrading the Samza framework without forcing developers to explicitly upgrade their running applications. With simple config changes, it supports canary, upgrade and rollback scenarios commonly required in organizations that run tens or hundreds of jobs.
- Samza Rest API (SAMZA-865, doc): The REST API provides a rich set of operations for the users to interact with their running jobs. Samza REST API allows you to start, stop and list jobs, and also run periodic monitoring scripts. This API can be integrated with deployment tooling and job dashboard for better job management.
- Disk monitoring (SAMZA-924): A Samza YARN cluster is used to run several stream processing applications on a shared set of physical machines. In such a multi-tenant environment it is critical to have some limits on the amount of disk space used by each job, especially to store application state. This feature introduces the measurement of the disk usage for selected job directories. The disk space usage will be gathered periodically and reported to Samza metrics. In the next release this feature will be extended to also enforce the disk quotas.
- New metrics to troubleshoot and monitor performance issues: SAMZA-972 added holistic monitoring of memory in Samza applications. With SAMZA-963 we added the ability to troubleshoot performance issues better by isolating the time spent in the application from the time spent in accessing state.
A source download of the 0.11.0 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.
Project Status
A total of 62 contributors have contributed to the Samza Project so far. In this release 21,473 lines of code were added/changed.
With this release we also add 3 new committers to the Apache Samza community.
Recent Community Activities
There has been a lot of activities from the community during this release time frame. Here are links to some of them.
- Conferences:
- Stream processing Meetup @ LinkedIn
- Detailed list of links to other presentations can be found here
- Blogs:
Contribute!
There are a lot more exciting features to expect in our future release. Some of them are:
- Samza operators API (SAMZA-914)
- HDFS system consumer (SAMZA-967)
- Support for standalone Samza jobs (SAMZA-516)
- Disk quotas enforcement (SAMZA-956)
I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 08:20PM Oct 24, 2016
by xinyu in General |
Comments [62]
|
Announcing the release of Apache Samza 0.10.1
I am excited to announce that the Apache Samza 0.10.1 has been released. This is our fourth release as an Apache Top-level Project!
Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. It was originally created at LinkedIn and still continues to be used in production. The project is currently under active development with contributions from a diverse group of contributors and committers. Samza still continues to be used in production by many companies (such as Netflix, Uber, TripAdvisor etc. See PoweredBy) in the industry.
A source download of the 0.10.1 release is available here. The release JARs are also available in Apache's Maven repository. See Samza's download page for details.
Overall, 72 JIRAs were resolved in this release. This is a minor release consisting of some bug-fixes and robust improvements to features like coordinator stream, host-affinity etc. Samza continues to require Java 1.7+ and Yarn 2.6.1+.
A few notable enhancements are:
- Support static partition assignment in ProcessJobFactory (SAMZA-41)
- Slow start of Samza jobs with large number of containers (SAMZA-843)
- Change log not working properly with In memory Store (SAMZA-889)
- Refactor and fix Container allocation logic (SAMZA-866)
- Detect partition count changes in input streams (SAMZA-882)
- Host Affinity - State restore doesn't work if the previous shutdown was uncontrolled (continuous offset) (SAMZA-905)
- Broadcast stream is not added properly in the prioritized tiers in the DefaultChooser (SAMZA-944)
- Improve the performance of the continuous OFFSET checkpointing for logged stores (SAMZA-964)
- Host Affinity - Minimize task reassignment when container count changes (SAMZA-906)
- Improve event loop timing metrics (SAMZA-951)
- Avoid unnecessary flushes in CachedStore (SAMZA-873)
- Incompatible change in Kafka producer that does not honor custom partitioners (SAMZA-839)
We've also made a lot of community progress during this release:
- We had 2 successful meetups - one in February and the other in June. The upcoming meetup is scheduled for August 23.
- Apache Samza was presented at the Apache Big Data (North America) conference in May 2016 and at the Hadoop Summit in June 2016. Check out the content here.
- Samza paper/workshop was also accepted at notable academic conferences:
- SamzaSQL: Scalable Fast Data Management with Streaming SQL presented at IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) in May 2016
- Effective Multi-stream Joining in Apache Samza Framework in 5th IEEE International Congress on Big Data, June 27 - July 2, 2016, San Francisco, USA
- 380 emails sent to the developer mailing list in past 3 months
- Support multi-threading in samza tasks (SAMZA-863)
- Disk Quotas: Add throttler and disk quota enforcement (SAMZA-956)
- REST API for starting and stopping Samza jobs (SAMZA-865)
- Samza standalone mode (SAMZA-516)
- High-level language for Samza (SAMZA-390)
It’s a great time to get involved. You can start by running through the hello-samza tutorial, signing up for the mailing list, and grabbing some newbie JIRAs. Also, don’t miss out the upcoming meetup on August 23. Sign up now!
I'd like to close by thanking everyone who's been involved in the project. It's been a great experience to be involved in this community, and I look forward to its continued growth.
Posted at 12:30AM Aug 10, 2016
by navina in General |
Comments [31]
|