Recently Microsoft released the new SQL Server Data Tools for Visual Studio 2015 (still in Preview) which includes new SQL Server Integration Services (SSIS) Tasks to extract and load data from and into a Apache Hadoop Cluster. This is nice – but it is even possible with older versions of SSIS.
Let me first give you a short description of the mentioned Apache Hadoop Services: The KNOX Gateway is used to secure Hadoop clusters as it provides a single point of authentication and authorization. So users can access Apache Hadoop Services and their APIs like WebHDFS or WebHCat over an own REST API which delegates the different requests to the corresponding services. The KNOX Service comes with an integrated “Demo-LDAP” for authentication – so we use the preconfigured guest user to work with the service. To use the LDAP, you have to start it over “Service Actions”
Start LDAP Service
SSIS Execute Task
This is an example for uploading a file to HDFS by using the WebHDFS API. The File is zipped as GZIP – a format that is known by HIVE for table compression. That means, the file can directly be uploaded into an external table folder and can be queried immediately.
Path to curl.exe
-iku guest:guest-password -L -T sample_table.gz -X PUT https://127.0.0.1:8443/gateway/default/webhdfs/v1/user/guest/sample_table/sample_table.gz?op=CREATE
Path to curl.exe
-v -i -k -u guest:guest-password -X POST -d execute=“CREATE+EXTERNAL+TABLE+sample_test+(id+string,fistname+string,lastname+string)+ROW+FORMAT+DELIMITED+FIELDS+TERMINATED+BY+’\073’+LOCATION+’/user/guest/sample_table/‘;“ -d statusdir=“tmp“ „https://127.0.0.1:8443/gateway/default/templeton/v1/hive“ After execution the new table will show up in the database explorer of the Ambari HIVE view.
Hive View Database Explorer
Hive View Query Editor
If you want WebHCat to use a script file in HDFS to run your hive program you can replace “execute” with “file” and the HIVE statement with the file path.
You see – this is a very simple way to use the powerful SQL Server Integration Services (SSIS) to do ETL with Hadoop.