Following my last blog about an Internet of Things project with the company Dr. A. Kuntze, I will now drill down for one component we used, that is Azure Data Factory – the workflow engine provided in the cloud.
We used Azure Data Factory to define our nightly batch load. A simple example of a workflow is represented in this Azure Data Factory Diagram:
As you can see we have 2 different inputs for our single pipeline and have defined 2 different outputs, which are essentially steps within the Hive Transformation within the HDInsight (Hadoop) –Service on Azure.
OnpremiseSQLServerDataset: Here we pull the reference data, which are the registered sensors and some of their masterdata
HiveInputBlobTable: Actually this is a reference to a folder, where during the day all of our sensor data which is transmitted through the Real-Time load going through Azure Stream Analytics is stored in the end. Usally here we have several large volume JSON Files.
HivePipeline: Within this Pipeline we have defined one activity which is our HIVE transformation script.
HiveOutputBlobTable: First the JSON files are being transformed into a single hive table.
HiveOutputBlobTableCSV: In a second step we apply business rules like alarming and calculate derived columns. The resulting CSV file is in a format that can directly be consumed by Power BI in a daily refresh scenario.
We defined our Input- and Output-References to another storage container than the storage container where the HDInsight cluster is provisioned. That gives us the comfort to only have our HDInsight service cluster up and running for the time of the transformation and then drop the cluster, which gives us a cost efficient approach.
In the following short video you will see an overview of the current possibilities within Azure Data Factory.
When you include objects like linked services or datasets from the Azure Data Factory menu you (still) simply get a JSON template which you have to configure. This is also true for the definition of the pipeline. So the diagram as can be seen above in the simple example is the result of a correct JSON configuration. Nevertheless setting up an Azure Data Factory workflow is not very complex and the graphical overview and logs of the last executions is essential for professional maintenance of those workflows which connect on premise and cloud data sources. Without your debugging is quite time consuming. See this example to get the idea:
One handy thing that was not mentioned in the short video is the possibility to configure computation tasks, like provision your HDInsight cluster (so that it is only available for the time of transformation) or do batch-requests against Azure ML (Machine Learning).
The kind of processing you do with Azure Data Factory is quite different to traditional ETL (Extract, Transform, Load) where you load the transformed data to a fixed and predefined schema.
With Azure Data Factory you process data the Hadoop way: Extract and Load directly – which means the first Load is in original format (ELT). That also means you don’t lose any data. From then on you transform you data in several steps till you have the format and schema you need.
Of course not all workflows are that simple like in our demo example. What is nice about Azure Data Factory is that you can break down you overall workflow in handy parts and then connect the parts.
The result of a pipeline which can include 1-n activities (transformations) and the corresponding result dataset can serve as an input for the next DataHub.
So here is an overview of the key benefits of Azure Data Factory:
· Connect Cloud and On Premise Data Sources
· Supports Hive, Pig & C# processing
· Automatic Hadoop (HDInsight) Cluster Management
· Retries for transient failures, configurable timeout policies & alerting
· Monitor data pipeline in one place
· Visually track data lineage
· Full historical accounting of job execution, system health and dependencies in a single monitoring dashboard