Apache NiFi

Tuesday Jun 06, 2017

Real-Time SQL on Event Streams

Real-Time SQL On Event Streams

Mark Payne -  @dataflowmark


Apache NiFi has grown tremendously over the past 2 and a half years since it was open sourced. The community is continuously thinking of, implementing, and contributing amazing new features. The newly released version 1.2.0 of NiFi is no exception. One of the most exciting features of this new release is the QueryRecord Processor and the Record Reader and Record Writer components that go along with it. If you aren't familiar with those components, there's a blog post that explains how those work.

This new Processor, powered by Apache Calcite, allows users to write SQL SELECT statements to run over their data as it streams through the system. Each FlowFile in NiFi can be treated as if it were a database table named FLOWFILE. These SQL queries can be used to filter specific columns or fields from your data, rename those columns/fields, filter rows, perform calculations and aggregations on the data, route the data, or whatever else you may want to use SQL for. All from the comfy confines of the most widely known and used Domain Specific Language.

This is a big deal! Of course, there are already other platforms that allow you to run SQL over arbitrary data, though. So let's touch on how NiFi differs from other platforms that run SQL over arbitrary data outside of an RDBMS:

  • Queries are run locally. There is no need to push the data to some external service, such as S3, in order to run queries over your data. There's also no need to pay for that cloud storage or the bandwidth, or to create temporary "staging tables" in a database.
  • Queries are run inline. Since your data is already streaming through NiFi, it is very convenient to add a new QueryRecord Processor to your canvas. You are already streaming your data through NiFi, aren't you? If not, you may find Our Docs Page helpful to get started learning more about NiFi.
  • Query data in any format. Write results in any data format. One of the goals of this effort was to allow data to be queried regardless of the format. In order to accomplish this, the Processor was designed to be configured with a "Record Reader" Controller Service and a "Record Writer" Controller Service. Out of the box, there are readers for CSV, JSON, Avro, and even log data. The results of the query can be written out in CSV, JSON, Avro, or free-form text (for example, a log format) using the NiFi Expression Language. If your data is in another format, you are free to write your own implementation of the Record Reader and/or Record Writer Controller Service. A simple implementation can wrap an existing library to return Record objects from an InputStream. That's all that is needed to run SQL over your own custom data format! Writing the results in your own custom data format is similarly easy.
  • It's fast - very fast! Just how fast largely depends on the performance of your disks and the number of disks that you have available. The data must be read from disk, queried, and then the results written to disk. In most scenarios, reading of the data from disk is actually avoided due to operating system disk caching, though. We've tested the performance on a server with 32 cores and 64 GB RAM. We ran a continuous stream of JSON-formatted Provenance data through the Processor. To do this, we used the NiFi Provenance Reporting Task to send the Provenance data back to the NiFi instance. Because we wanted to stress the Processor, we then used a DuplicateFlowFile processor to create 200 copies of the data (this really just creates 200 pointers to the same piece of data; it doesn't copy the data itself). We used a SQL query to pull out any Provenance "SEND" event (a small percentage of the data, in this case). Using 12 concurrent tasks, we were able to see the query running at a consistent rate of about 1.2 GB per second on a single node - using less than half of the available CPU!
  • Data Provenance keeps a detailed trail of what happened. One of the biggest differentiators between NiFi and other dataflow platforms is the detailed Data Provenance that NiFi records. If the data that you end up with is not what you expect, the Data Provenance feature makes it easy to see exactly what the data looked like at each point in the flow and pinpoint exactly what when wrong - as well as understand where the data come from and where the data went. When the flow has been updated, the data can be replayed with the click of a button and the new results can be verified. If the results are still not right, update your flow and replay again.

How It Works

In order to get started, we need to add a QueryRecord Processor to the graph. Once we've added the Processor to the graph, we need to configure three things: the Controller Service to use for reading data, the service to use for writing the results, and one or more SQL queries. The first time that you set this all up, it may seem like there's a lot going on. But once you've done it a couple of times, it becomes pretty trivial. The rest of this post will be dedicated to setting everything up. First, we will configure the Record Reader Controller Service. We'll choose to create a new service:

We are given a handful of different types of services to choose from. For this example, we will use CSV data, so we will use a CSVReader:

We will come back and configure our new CSVReader service in a moment. For now, we will click the "Create" button and then choose to create a new Controller Service to write the results. We will write the results in JSON format:

We can again click the "Create" button to create the service. Now that we have created our Controller Services, we will click the "Go To" button on our reader service to jump to the Controller Service configuration page:

Configuring the Reader

We can click the pencil in the right-hand column to configure our CSV Reader:

The CSV Reader gives us plenty of options to customize the reader to our format, as can be seen in the above image. For this example, we will leave most of the defaults, but we will change the "Skip Header Line" Property from the default value of "false" to "true" because our data will contain a header line that we don't want to process as an actual record.

In order to run SQL over our data and make sense of the columns in our data, we need to also configure the reader with a schema for the data. We have several options for doing this. We can use an attribute on the FlowFile that includes an Avro-formatted schema. Or we can use a Schema Registry to store our schema and access it by name or by identifier and version. Let's go ahead and use a local Schema Registry and add our schema to that Registry. To do so, we will change the "Schema Access Strategy" to "Use 'Schema Name' Property." This means that we want to look up the Schema to use by name. The name of the schema is specified by the "Schema Name" Property. The default value for that property is "${schema.name}", which means that we will just use the "schema.name" attribute to identify which schema we want to use. Instead of using an attribute, we could just type the name of the schema here. Doing so would mean that we would need a different CSV Reader for each schema that we want to read, though. By using an attribute, we can have a single CSV Reader that works for all schemas.

Next, we need to specify the Schema Registry to use. We click the value of the "Schema Registry" property and choose to "Create new service..." We will use the AvroSchemaRegistry. (Note that our incoming data is in CSV format and the output will be in JSON. That's okay, Avro in this sense only refers to the format of the schema provided. So we will provide a schema in the same way that we would if we were using Avro.) We will click the "Create" button and then click the "Go To" arrow that appears in the right-hand column in order to jump to the Schema Registry service (and click 'Yes' to save changes to our CSV Reader service).

This will take us back to our Controller Services configuration screen. It is important to note that from this screen, each Controller Service has a "Usage" icon on the left-hand side (it looks like a little book). That icon will take you to the documentation on how to usage that specific Controller Service. The documentation is fairly extensive. Under the "Description" heading, each of the Record Readers and Writers has an "Additional Details..." link that provides much more detailed information about how to use service and provides examples.

We will click the Edit ("pencil") icon next to the newly created AvroSchemaRegistry and go to the Properties tab. Notice that the service has no properties, so we click the New Property ("+") icon in the top-right corner. The name of the property is the name that we will use to refer to the schema. Let's call it "hello-world". For the value, we can just type or paste in the schema that we want to use using the Avro Schema syntax. For this example, we will use the following schema:

{
  "name": "helloWorld",
  "namespace": "org.apache.nifi.blogs",
  "type": "record",
  "fields": [
      { "name": "purchase_no", "type": "long" },
      { "name": "customer_id", "type": "long" },
      { "name": "item_id", "type": ["null", "long"] },
      { "name": "item_name", "type": ["null", "string"] },
      { "name": "price", "type": ["null", "double"] },
      { "name": "quantity", "type": ["null", "int"] },
      { "name": "total_price", "type": ["null", "double"] }
  ]
}

Now we can click "OK" and apply our changes. Clicking Enable (the lightning bolt icon) enables the service. We can now also enable our CSV Reader.

Configuring the Writer

Similarly, we need to configure our writer with a schema so that NiFi knows how we expect our data to look. If we click the Pencil icon next to our JSONRecordSetWriter, in the Properties tab, we can configure whether we want our JSON data to be pretty-printed or not and how we want to write out date and time fields. We also need to specify how to access the schema and how to convey the schema to down-stream processing. For the "Schema Write Strategy," since we are using JSON, we will just set the "schema.name" attribute, so we will leave the default value. If we were writing in Avro, for example, we would probably want to include the schema in the data itself. For the "Schema Access Strategy," we will use the "Schema Name" property, and set the "Schema Name" property to "${schema.name}" just as we did with the CSV Reader. We then select the same AvroSchemaRegistry service for the "Schema Registry" property.

Again, we click "Apply" and then click the Lightning icon to enable our Controller Service and click the Enable button. We can then click the "X" to close out this dialog.

Write the SQL

Now comes the fun part! We can go back to configure our QueryRecord Processor. In the Properties tab, we can start writing our queries. For this example, let's take the following CSV data:


purchase_no, customer_id, item_id, item_name, price, quantity
10280, 40070, 1028, Box of pencils, 6.99, 2
10280, 40070, 4402, Stapler, 12.99, 1
12440, 28302, 1029, Box of ink pens, 8.99, 1
28340, 41028, 1028, Box of pencils, 6.99, 18
28340, 41028, 1029, Box of ink pens, 8.99, 18
28340, 41028, 2038, Printer paper, 14.99, 10
28340, 41028, 4018, Clear tape, 2.99, 10
28340, 41028, 3329, Tape dispenser, 14.99, 10
28340, 41028, 5192, Envelopes, 4.99, 45
28340, 41028, 3203, Laptop Computer, 978.88, 2
28340, 41028, 2937, 24\" Monitor, 329.98, 2
49102, 47208, 3204, Powerful Laptop Computer, 1680.99, 1

In our Properties tab, we can click the "Add Property" button to add a new property. Because we can add multiple SQL queries in a single Processor, we need a way to distinguish the results of each query and route the data appropriately. As such, the name of the property is the name of the Relationship that data matching the query should be routed to. We will create two queries. The first will be named "over.1000" and will include the purchase_no and customer_id fields of any purchase that cost more than $1,000.00 and will also include a new field named total_price that is the dollar amount for the entire purchase. Note that when entering a value for a property in NiFi, you can use Shift + Enter to insert a newline in your value:

SELECT purchase_no, customer_id, SUM(price * quantity) AS total_price
FROM FLOWFILE
GROUP BY purchase_no, customer_id
HAVING SUM(price * quantity) > 1000

The second property will be named "largest.order" and will contain the purchase_no, customer_id, and total price of the most expensive single purchase (as defined by price times quantity) in the data:

SELECT purchase_no, customer_id, SUM(price * quantity) AS total_price
FROM FLOWFILE
GROUP BY purchase_no, customer_id
ORDER BY total_price DESC
LIMIT 1

Now we will wire our QueryRecord processor up in our graph so that we can use it. For this demo, we will simply use a GenerateFlowFile to feed it data. We will set the "Custom Text" property to the CSV that we have shown above. In the "Scheduling" tab, I'll configure the processor to run once every 10 seconds so that I don't flood the system with data. We need to add a "schema.name" attribute, so we will route the "success" relationship of GenerateFlowFile to an UpdateAttribute processor. To this processor, we will add a new Property named "schema.name" with a value of "hello-world" (to match the name of the schema that we added to the AvroSchemaRegistry service). We will route the "success" relationship to QueryRecord.

Next, we will create two UpdateAttribute Processors and connect the "over.1000" relationship to the first and the "largest.order" relationship to the other. This just gives us a simple place to hold the data so that we can view it. I will loop the "failure" relationship back to the QueryRecord processor so that if there is a problem evaluating the SQL, the data will remain in my flow. I'll also auto-terminate the "original" relationship because once I have evaluated the SQL, I don't care about the original data anymore. I'll also auto-terminate the "success" relationship of each terminal UpdateAttribute processor. When we start the QueryRecord, GenerateFlowFile, and the first UpdateAttribute processors, we end up with a FlowFile queued up before each UpdateAttribute processor:

If we right-click on the "over.1000" connection and click "List queue," we are able to see the FlowFiles in the queue:

Clicking the "information" icon on the left-hand-side of the FlowFile gives us the ability to view the FlowFile's attributes and download or view its contents. We can click the "View" button to view the content. Changing the "View as" drop-down at the top from "original" to "formatted" gives us a pretty-printed version of the JSON, and we quickly see the results that we want:

Note that we have a null value for the columns that are defined in our schema that were not part of our results. If we wanted to, We could certainly update our schema in order to avoid having these fields show up at all.

Viewing the FlowFiles of the "largest.order" connection shows us a single FlowFile also, with the content that we expect there as well:

Of course, if we have already run the data through the flow and want to go back and inspect that data and what happened to it, we can easily do that through NiFi's powerful Data Provenance feature.

Conclusion

Here, I have given but a small glimpse into the capabilities of the new QueryRecord Processor for NiFi. The possibilities that it opens up are huge. I'd love to hear thoughts and ideas on how to make it even better or questions that you may have! Feel free to reach out to the NiFi mailing list at users@nifi.apache.org or dev@nifi.apache.org.

 

Comments:

Mark, very interesting. Of course the aggregation is done on a flow file level. I wonder if you have a suggestion of how a workflow could look like if one wanted to aggregate across multiple flow files. Let's say one wants to aggregate data for a certain day. But then maybe that is not possible or too complicated and one would do it outside of nifi. But I wanted to ask anyway. Thanks, Uwe

Posted by uwe geercken on June 12, 2017 at 11:57 AM GMT #

Uwe, That is correct - aggregations are done on a FlowFile level, because each FlowFile is treated as an individual table. If you want to do aggregates across a time boundary such as a one-day boundary, you could either use the MergeContent processor (Will start working on a MergeRecord soon) to merge all of the records together for the same day. If you have logs that roll every day or every hour, for example, this may work well. Or, you could push to something like Spark, Storm, etc. to do windowed computations.

Posted by Mark Payne on June 12, 2017 at 12:16 PM GMT #

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Search