Apex

Thursday December 08, 2016

Announcing Apache Apex Malhar 3.6.0 Release

The Apache Apex community is pleased to announce release 3.6.0 of the Malhar library. The release resolved 70 JIRAs.

The release adds first iteration of SQL support via Apache Calcite. Supported features include SELECT, INSERT, INNER JOIN with non-empty equi join condition, WHERE clause, SCALAR functions that are implemented in Calcite, custom scalar functions. Endpoint can be file, Kafka or internal streaming port for both input and output. CSV format is implemented for both input and output. See examples for usage of the new API.

The windowed state management has been improved (WindowedOperator). There is now an option to use spillable data structures for the state storage. This enables the operator to store large states and perform efficient checkpointing.

We also did benchmarking on WindowedOperator with the spillable data structures. From the result of our findings, we improved greatly how objects are serialized and reduced garbage collection considerably in the Managed State layer. Work is still in progress for purging state that is not needed any more and further improving the performance of Managed State that the spillable data structures depend on. More information about the windowing support can be found here.

This release also adds a new, alternative Apache Cassandra output operator (non-transactional, upsert based) and support for fixed length file format to the enrichment operator.

The user documentation has been expanded to cover more operators. See https://s.apache.org/9b0t for other enhancements and fixes in this release.

Apache Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream and batch processing. Apex was built for scalability and low-latency processing, high availability and operability.

Apex provides unique features that similar platforms currently don't offer, such as fine grained, incremental recovery to only reset the portion of a topology that is affected by a failure, support for elastic scaling based on the ability to acquire (and release) resources as needed as well as the ability to alter topology and operator properties on running applications.

Apex has been developed since 2012 and became ASF top level project earlier this year, following 8 months of incubation. Apex early on brought the combination of high throughput, low latency and fault tolerance with strong processing guarantees to the stream data processing space and gained maturity through important production use cases at several organizations. See the powered by page and resources on the project web site for more information.

The Apex engine is supplemented by Malhar, the library of pre-built operators, including adapters that integrate with many existing technologies as sources and destinations, like message buses, databases, files or social media feeds.

An easy way to get started with Apex is to pick one of the examples as starting point. They cover many common and recurring tasks, such as data consumption from different sources, output to various sinks, partitioning and fault tolerance.

Apex Malhar and Core (the engine) are separate repositories and releases. We expect more frequent releases of Malhar to roll out new connectors and other operators based on a stable engine API. This release 3.6.0 works on existing Apex Core 3.4 installations. Users only need to upgrade the Maven dependency in their project.

The source release can be found at:

http://apex.apache.org/downloads.html

We welcome your help and feedback. For more information on the project and how to get involved, visit our website at:

http://apex.apache.org/

Tuesday September 06, 2016

Announcing Apache Apex Malhar 3.5.0 Release

The Apache Apex community is pleased to announce release 3.5.0 of the Malhar library.

The release resolved 63 JIRAs and comes with exciting new features and enhancements, including:

  • Windowed Operator that supports the windowing semantics outlined by Apache Beam and Google Cloud DataFlow, including the concepts of event time windows, session windows, watermarks, allowed lateness, and triggering. As of this release, the Windowed Operator manages its state in memory that is checkpointed to DFS, and work is underway to make the state management scalable.

  • High level Java stream API now uses the aforementioned Windowed Operator to support stateful transformation with Apache Beam style windowing semantics. The demo package has examples for usage of the API and more info in this presentation.

  • Introduction of Spillable Data Structures that make use of Managed State. This is an abstraction layer over Managed State that provides ease of use to state management components in custom operators and incremental state saving. Work is underway to take this further in the next release, along with benchmarking for key cardinality and throughput, and to be used by the Windowed Operator for managing large state.

  • Deduper solves a frequent task in processing stream data, to decide whether a given record is a duplicate or not. The documentation explains this in detail.

  • JDBC poll operator, another operator frequently required in Apex use cases. The operator can function as bounded or unbounded source, is idempotent for exactly-once processing and partitionable. An example can be found here.

  • Enricher that essentially joins a stream with a lookup source and can operate on any POJO object. The user can solve this through configuration and does not need to write code for the operator. See documentation which has an example linked.

There are more features, enhancements and fixes is this release, see https://s.apache.org/5vQi for full changes.

Apache Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream and batch processing. Apex was built for scalability and low-latency processing, high availability and operability.

Apex provides unique features that similar platforms currently don't offer, such as fine grained, incremental recovery to only reset the portion of a topology that is affected by a failure, support for elastic scaling based on the ability to acquire (and release) resources as needed as well as the ability to alter topology and operator properties on running applications.

Apex has been developed since 2012 and became ASF top level project earlier this year, following 8 months of incubation. Apex early on brought the combination of high throughput, low latency and fault tolerance with strong processing guarantees to the stream data processing space and gained maturity through important production use cases at several organizations. See the powered by page and resources on the project web site for more information.

The Apex engine is supplemented by Malhar, the library of pre-built operators, including adapters that integrate with many existing technologies as sources and destinations, like message buses, databases, files or social media feeds.

An easy way to get started with Apex is to pick one of the examples as starting point. They cover many common and recurring tasks, such as data consumption from different sources, output to various sinks, partitioning and fault tolerance.

Apex Malhar and Core (the engine) are separate repositories and releases. We expect more frequent releases of Malhar to roll out new connectors and other operators based on a stable engine API. This release 3.5.0 works on existing Apex Core 3.4 installations. Users only need to upgrade the Maven dependency in their project.

The source release can be found at:

http://apex.apache.org/downloads.html

We welcome your help and feedback. For more information on the project and how to get involved, visit our website at:

http://apex.apache.org/

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation