Thomas Telaak

Recently Microsoft released the new SQL Server Data Tools for Visual Studio 2015 (still in Preview) which includes new SQL Server Integration Services (SSIS) Tasks to extract and load data from and into a Apache Hadoop Cluster. This is nice – but it is even possible with older versions of SSIS.

This blog post will show you how to use cURL and the Microsoft SQL Server Integration Services (SSIS) in conjunction with WebHCat, WebHDFS and the KNOX Gateway.

Let me first give you a short description of the mentioned Apache Hadoop Services: The KNOX Gateway is used to secure Hadoop clusters as it provides a single point of authentication and authorization. So users can access Apache Hadoop Services and their APIs like WebHDFS or WebHCat over an own REST API which delegates the different requests to the corresponding services. The KNOX Service comes with an integrated “Demo-LDAP” for authentication – so we use the preconfigured guest user to work with the service. To use the LDAP, you have to start it over “Service Actions”

Start LDAP Service

SSIS - Start LDAP Service

WebHDFS is a REST API that provides a complete file system interface for HDFS.
WebHCat is a REST API for Hadoops table and storage management layer HCatalog.
To communicate with the REST APIs in SSIS we just need a tool that provides this functionality.
I used cURL for Windows  in an “execute process task”. (see screenshot)

SSIS Execute Task

SSIS Execute Task

This is an example for uploading a file to HDFS by using the WebHDFS API. The File is zipped as GZIP – a format that is known by HIVE for table compression. That means, the file can directly be uploaded into an external table folder and can be queried immediately.

Executable:
Path to curl.exe
Arguments:
-iku guest:guest-password  -L -T sample_table.gz -X PUT https://127.0.0.1:8443/gateway/default/webhdfs/v1/user/guest/sample_table/sample_table.gz?op=CREATE
 
After the file is uploaded we can use another “Execute Process Task” to use cURL for creating a new
external table. The location of the table will be the created sample_table directory. This time we use the HCatalog API – WebHCat:
Executable:
Path to curl.exe
Arguments:

-v -i -k -u guest:guest-password -X POST -d  execute=“CREATE+EXTERNAL+TABLE+sample_test+(id+string,fistname+string,lastname+string)+ROW+FORMAT+DELIMITED+FIELDS+TERMINATED+BY+’\073’+LOCATION+’/user/guest/sample_table/‘;“ -d statusdir=“tmp“ „https://127.0.0.1:8443/gateway/default/templeton/v1/hive“ After execution the new table will show up in the database explorer of the Ambari HIVE view.

Hive View Database Explorer

Hive View Database Explorer SSIS

Now you can simply query the table or do some more Transformation.

Hive View Query Editor

Hive View Query Editor - SSIS

If you want WebHCat to use a script file in HDFS to run your hive program you can replace “execute” with “file” and the HIVE statement with the file path.

You see – this is a very simple way to use the powerful SQL Server Integration Services (SSIS) to do ETL with Hadoop.