Apache Falcon

Tuesday March 01, 2016

What's new in Falcon 0.9?

With new feature requirements flowing in constantly, Falcon project is making more frequent releases to ensure the features become available to the users at the earliest.  The latest in this string of releases is Falcon 0.9 that was announced last week. There are many new features that the community worked on and a whole lot of product improvements. Some of the features that stand out are:




  • Native time-based scheduling

  • Ability to import from and export to a database

  • Additional API support in Falcon Unit




Native time-based scheduling


Falcon has been using Oozie as its scheduling engine. While the use of Oozie works reasonably well, there are scenarios where Oozie scheduling is proving to be a limiting factor, such as:




  • Simple periodic scheduling with no gating conditions.

  • Calendar based time triggers, example, last working day of every month.

  • Scheduling based on data availability for a-periodic datasets.

  • External triggers.

  • Data based predicates such as availability of a minimum subset of instances of data.


To overcome these limitations, a native scheduler will be built and released over the next few releases of Falcon giving users an opportunity to use it and provide early feedback. In 0.9 release, only time-based scheduling without data dependency is supported.


Before natively scheduling processes on Falcon, you must make some changes to startup.properties. For more details refer -> http://falcon.apache.org/0.9/FalconNativeScheduler.html. Once that is done, you can schedule a process using the native scheduler as follows:


falcon entity -type process -name <process name> -schedule -properties falcon.scheduler:native


All the entity and instance APIs work seamlessly irrespective of whether the entity is scheduled on Oozie or natively.


Data Import and Export


In this release, Falcon provides constructs to periodically bring raw data from external data sources (like databases, drop boxes etc.) onto Hadoop and push data computed on Hadoop onto external data sources. Currently, Falcon only supports Relational Databases (e.g. Oracle, MySQL etc) via JDBC as external data source. The future releases will add support for other external data sources.


To allow users to specify an external data source (or sink), Falcon has introduced a new entity type,datasource. Users can provide the datasource connection and JDBC connector details declaratively in the datasource definition. Users then need to submit this entity as follows:


falcon entity -submit -type datasource -file mysql_datasource.xml


This datasource can then be referenced in the feed definition for import or export operations as shown below:


<feed description="Customer data" name="CustomerFeed" xmlns="uri:falcon:feed:0.1">


<clusters>


<cluster name="testCluster" type="source">
<import>


<source name="test-hsql-db" tableName="customer">


.....


</import>


The above feed can then be submitted and scheduled for periodic import of data. For more details refer -> http://falcon.apache.org/0.9/ImportExport.html


Additional API support in Falcon Unit


Falcon Unit has been enhanced to add support for entity delete, update and validate APIs, instance management APIs and admin APIs. You can unit test your Falcon processes in your local environment (or a cluster) and use all the Falcon APIs in your test for validations. Testing your process with data dependency is as simple as:


submitCluster();


submit(EntityType.Feed, clicks.xml);


createData("HourlyClicks", "local", scheduleTime, test-data, numinstances);


submit(EntityType.Process, daily_clicks.xml);


APIResult result = scheduleProcess("daily_clicks_agg", startTime, numInstances, clusterName);


waitForStatus(EntityType.PROCESS.name(),"daily_clicks_agg", scheduleTime, InstancesResult.WorkflowStatus.SUCCEEDED);


For more usage examples, you can see example here.


Other notable improvements include:


Ability to capture Hive DR replication metrics - Hive DR feature introduced in Falcon 0.8 invokes replication job to transfer Hive metadata and data from source to destination. In Falcon 0.9, ability to capture the data transfer details such as bytes transferred, files copied by replication job, etc. has been introduced, along with ability to retrieve these details from CLI.


Support for re-tries of timed out instances - Ability to retry FAILED instances has now been extended to add ability to retry TIMED_OUT instances too. This is useful in scenarios where feed delays are expected. User can now specify retry with periodic/exponential back-off without having to specify a large timeout value.


Go ahead, download Falcon 0.9 and try out the new features.





Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation