Stephan Köppen

One of the most important services within the PDW is the Data Movement Service called DMS. It is part of the control node as well as every compute node. The name itself already gives a hint of what it does, but we will go a little further into detail.

As we know, PDW and its MPP based approach, split the distributed table by a defined column on all the distributions of a compute node. In order for this to work, each row inserted is checked by the DMS, a hash value gets created and moved towards the correct compute node. It also analysis and processes queries send to the control node and moves the data between the nodes and back if necessary. This also included optimizing, so that the data being shuffled stays as little as possible.

So what does it do:
  • Builds the hash values for distributing and checks every row
  • It transfers data between SQL PDW Nodes. This includes Compute Nodes as well as the Control Node.
  • It processes the query operation which needs data shuffling (incompatible join e.g.)
  • Improves data movement to optimize the query

This article is meant to explain the loading process and why it is important to choose datatypes, as well as the database design well to get the best performance. So how exactly does a load process work? Let’s assume we are going to load a flat file via dwloader. In our example we have two nodes to hold the data.

1) DWLoader sends 256kb chunks round robin to the receiver queue of the DMS
2) The ODBC Converter takes each row and converts them to ODBC Native Client. This means that a column with nvarchar(100) takes up 200bytes, even though it only holds the text „hello“. The rest is filled with blanks.
3) A hash is being generated based on the distribution column chosen for the the table
4) For each distribution a cache (No. of nodes * 8 distributions) is created where the row is stored

 

5) The queue entries for the distributions are send back to the particular distribution. In our case the rows for Distribution 1-8 are send from both nodes the reader queue on the first node
6) To finally write the data to the database, the writer queue takes out the data of the reader queue and stores it

 

7) This also works for the distributions from 9-16.  The data is written from both compute nodes to the distributions on the second compute node

 

Since the process of loading data into the PDW is explained, what exactly do we have to keep in mind so we don’t run into any issues during the load process as well as getting the best query performance.
  • Make sure that the loaded table doesn’t extend the 32kb limit. Even though a table with 5x nvarchar(max) can be created, a load with dwloader is not possible since it blows ups the data to ~40kb. Even though you can create the table and also write a row into the table with Datatools you would get another problem.
  • Let’s assume you have a created the above mentioned scenario. In case you query the table and any kind of data movement is necessary the process of converting occurs again, which let’s PDW not execute the statement. This happens because the rows are blown up again so the DMS can shuffle the data.
  • If you size your columns to big (e.g 2x int and 4x nvarchar(max)) you shuffle way to much data because each column has to be extended to its maximum.