Hilmar Buchta

SQL Server 2016

One of the exciting new features in SQL Server 2016 that is already available in the public preview version (CTP 2) is Polybase, the transparent access bridge from SQL to Hadoop. This functionality has been available in the Microsoft Analytics Platform System (APS) since version 2012 and has now made it’s way into SQL Server. You can find a lot more information in this blog post for example: http://blogs.technet.com/b/dataplatforminsider/archive/2014/06/02/polybase-in-aps-yet-another-sql-over-hadoop-solution.aspx

This blog post covers the first steps for using Polybase in SQL Server 2016 community preview 2 (CTP 2).

First of all, you need to install Polybase functionality during the installation process by checking the appropriate option:

image

This installs two services for Polybase as shown below:

image

Also, when opening a database in SQL Server Management Studio you’ll see some new objects:

image

External tables are your gateway to hadoop. Later in this post we will create an external table to a Hadoop file which can then be used mostly like a normal SQL Server table.

External data sources define the connection to the Hadoop cluster (more precisely to the Hadoop file system, HDFS), while external file formats are used to define the structure of the file on the Hadoop cluster.

Before we can start, we need to configure SQL Server for Hadoop connectivity. The following code is an example:

EXEC sp_configure ‚hadoop connectivity‘, 5;
GO
RECONFIGURE;
GO

You can use the following configuration values:

0 no Hadoop connectivity (default)
3 Enable connectivity to Cloudera CDH for Linux
4 Enable connectivity to Hortonworks Data Platform for Windows Server (HDP), HDInsight on Analytics Platform System, or HDInsight’s Microsoft Azure blob storage
5 Enable connectivity to Hortonworks Data Platform (HDP)for Linux

In my case, I’m using a Hortonworks Sandbox on Linux, so I’m using the configuration value 5.

Also, the user that is used to access Hadoop has to be configured in the Polybase configuration file which can be found in the Binn\Polybase\Settings sub directory of the instance, in my case

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

The default user is pdw_user which I changed to ‘hue’ here:

image

To make all these changes active, I had to restart SQL Server and the two Polybase services.

Now, let’s create an external data source. You can create a code snippet from the context menu of the external data sources folder. Here is the code I’m using for my sandbox:

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
    TYPE = HADOOP,
    LOCATION = ‚hdfs://sandbox.hortonworks.com:8020‘
)

Next, we need to create an external file format. I’d like to access the tab delimited file sample_07 from Hive which is shipped with the sandbox. Therefore the external file format looks like this:

CREATE EXTERNAL FILE FORMAT TSV
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ‚\t‘,
        DATE_FORMAT = ‚MM/dd/yyyy‘
    )
)

The date format is not needed for my case (there is no date column in this file) but just there to show the functionality.

The two new server objects now appear in the “External Resources” section of the database.

image

Next, we can create the external table itself using the defined data source and file format.

create external table sample_07
(
    code nvarchar(255),
    description nvarchar(255),
    total_emp int,
    salary nvarchar(255)
)
WITH
(
    LOCATION = ‚/apps/hive/warehouse/sample_07‘,
    DATA_SOURCE = HDP2,
    FILE_FORMAT = TSV,
    REJECT_TYPE = value,
    REJECT_VALUE=0
)

The new external table appears in the “External Tables” folder of the database.

image

We can now query the table like any other table in the database, for example:

select * from Sample_07

image

Here is another example

select left(code,2) code2, sum(total_emp) total_emp, count(*) Count
from Sample_07
group by left(code,2)

And of course you can also write queries with joins between Polybase external tables and local SQL Server tables which gives a transparant SQL query experience to Hadoop data.

 

Final notes

  • The functionality and syntax may change in the final product, so this only applies to the technical preview 2 (CTP2)
  • Currently, writing to a hadoop file is not supported (there should be something like the “create external table as select” syntax in the APS)
  • This post is just a brief overview. There are many more options (for example more file formats, compression etc.) that can be used. A good orientation can be found in the Microsoft APS support tools download which can be found here: http://www.microsoft.com/en-us/download/details.aspx?id=45294