Stephan Köppen

Since the PDW is a fairly new product, at some points you have to help the optimizer to build better execution plans. Within this article we will learn how to read the explain plan and give query hints to speed up the select Statement.

Requirement is a working copy of the AdventureWorksDW Database for PDW.

Preparing the Table by adding some additional data

Within a DWH it happens quite often that a Fact table references to the same dimension multiple times. This might occur when you buy a piece of hardware in a computer chain. There is always the initial store where you bought your computer and for the follow up, like repair stuff or modification, you can go to another store.

Therefore we add the SalesTerritoryKey again and fill it up with some random Data. We also modify our shipdatekey with some random information.

ALTER TABLE factinternetsales
  ADD salesterritorykey2 INT

UPDATE factinternetsales
SET    salesterritorykey2 = CASE
                              WHEN salesterritorykey IN ( 5, 8 ) THEN
                              salesterritorykey
                              WHEN salesterritorykey IN
                                   (SELECT Max(salesterritorykey)
                                    FROM   factinternetsales) THEN
                              salesterritorykey  1
                              ELSE salesterritorykey + 1
                            END

UPDATE factinternetsales
SET    shipdatekey = CASE
                       WHEN shipdatekey % 10 = 1 THEN 20130501
                       WHEN shipdatekey % 10 = 2 THEN 20130601
                       WHEN shipdatekey % 10 = 3 THEN 20130701
                       ELSE 20130801
                     END 

For the following analysis and optimization we update the statistics first, so we can make sure the query optimizer has the latest information.


UPDATE STATISTICS factinternetsales
UPDATE STATISTICS dimsalesterritory
UPDATE STATISTICS dimpromotion 

Querying the PDW

Now, that all of the preparation is done, lets start with a simple query we would find in our daily business.


SELECT Count(*),
       Sum(fis.salesamount),
       dst1.salesterritorycountry AS CountryFrom,
       dst2.salesterritorycountry AS CountryTo,
       dp.englishpromotiontype
FROM   factinternetsales fis
       INNER JOIN dimsalesterritory dst1
               ON fis.salesterritorykey = dst1.salesterritorykey
       INNER JOIN dimsalesterritory dst2
               ON fis.salesterritorykey2 = dst1.salesterritorykey
       INNER JOIN dimpromotion dp
               ON fis.promotionkey = dp.promotionkey
GROUP  BY dst1.salesterritorycountry,
          dst2.salesterritorycountry,
          dp.englishpromotiontype,
          fis.shipdatekey,
          fis.discountamount,
          fis.customerponumber,
          fis.customerkey 

As we can see, the statement includes a couple joins, aggregations and groupings.
The query itself executes in 3 seconds.

If we check the execution plan, we will find something very interesting.

Key1

In our scenario, the optimizer decides to use the DiscountAmount Column as Shuffle Column.

A simple count and group query shows us that the cardinality of this column is 1, which lets the whole shuffle operation take place on 1 node.


SELECT Count(*),
       discountamount
FROM   factinternetsales
GROUP  BY discountamount 

Key2

Now in this case it doesn’t really matter since it is just 3 seconds, but in other cases it could lead to minutes if not even hours.

Let’s use a different column with a better cardinality e.g. the CustomerKey column. A quick test shows that it is a great column for distribution.


SELECT Count(*),
       customerkey
FROM   factinternetsales
GROUP  BY customerkey 

Key3

PDW offers some great query hints to influence the execution of statements. Therefore we use the „DISTRIBUTED_AGG“ hint within the group by clause to tell the PDW what column to shuffle on.


SELECT Count(*),
       Sum(fis.salesamount),
       dst1.salesterritorycountry AS CountryFrom,
       dst2.salesterritorycountry AS CountryTo,
       dp.englishpromotiontype
FROM   factinternetsales fis
       INNER JOIN dimsalesterritory dst1
               ON fis.salesterritorykey = dst1.salesterritorykey
       INNER JOIN dimsalesterritory dst2
               ON fis.salesterritorykey2 = dst1.salesterritorykey
       INNER JOIN dimpromotion dp
               ON fis.promotionkey = dp.promotionkey
GROUP  BY dst1.salesterritorycountry,
          dst2.salesterritorycountry,
          dp.englishpromotiontype,
          fis.shipdatekey,
          fis.discountamount,
          fis.customerponumber,
          fis.customerkey with (distribution_agg) 

This is what the newly created execution plan looks like.

Key4

After doing so, the execution plan shows that it has changed the column and the execution doubled the performance.