Hilmar Buchta

PDW 2012 | SQL Server 2005-2014

With database servers getting more and more powerful, some traditional concepts of Business Intelligence solution may be reconsidered. One of those concepts for which you can find a lot of lively debates recently is the question of whether to use ETL or ELT.

Here are just a few of the blog posts and discussions you can find on this topic:

In short, the main differences are shown in the table below:

  ETL ELT
Characteristics
  • ETL=Extract-Transform-Load
  • Transformation done in the ETL tool (data flow pipeline), only the finally prepared data is loaded to the data warehouse data base
  • ELT=Extract-Load-Transform
  • Original data is loaded into database, then SQL is used to transform the data into the dimensional model
Pros
  • Well established ETL tools available with debugging, build in logging, configuration, error handling and process reporting and statistics
  • comprehensive and easy to maintain data flow makes it easy to merge data from different sources, use of data quality tools and individually deal with error rows.
  • ELT fully benefits from the database power, query optimizer and so on.
  • Especially for MPP environments (like the PDW): Scaling the database means scaling up for ELT process perfromance as well
  • SQL code is easier to maintain in source control systems (like TFS) than ETL packages (complex XML).
Cons
  • ETL pipeline tools support multiple cores, but parallel IO has to be solved programmatically: you have to do something
    (for example use the balanced data distributor SSIS component or merge multiple inputs from the same source table)
  • ETL tools are build for row based operations. Operations that need to be performed on a set of rows (like sort and aggregate or calculations covering multiple rows) are harder to solve. I wrote some posts recently about ELT calculations that are relatively difficult to solve in ETL.
  • SQL is harder to read, to structure and to document compared to ETL packages
  • You need discipline as minor wrongness may lead to errors that are hard to track down (e.g. too many resulting lines from join operation if a key is missing in the join)

This comparison is by far not complete and if you read the links above (and many others that target this topic) you can find a lot more pros/cons and opinions. In fact, I don’t want to say one is better than the other. But here is what we recently found to work well in a large project using the Parallel Data Warehouse (PDW) for an initially 30TB (and growing) database. The following illustration which I recently used on SQL Conference 2014 in Germany shows the key concepts:

image

 

We’re using ETL (Microsoft SQL Server Integration Services) to

  • Orchestrate the load process
    • workflow management (make sure the right things happen in the right order)
    • dealing with technical challenges (e.g. temporary tables, partition switching on PDW)
    • implement configuration management (for example server and database names)
    • logging and process monitoring (reports)
  • Load dimensions (small amount of data)
    • Collecting master data from source systems, merge and prepare this data
    • Generation of surrogate keys for the dimensions
    • Keeping track of historical changes (modeled as SCD2 or intermediate bridge tables)
    • Building up all dimensions and transferring the data to the PDW (using a reload operation)
    • Early arriving facts (create missing dimension rows, distinct counts run on PDW)

Why?

  • Integration Services (SSIS) well suited for these tasks
  • SMP SQL Server offers good support for dimension processing tasks (identity column, T-SQL merge statement etc.)
  • Additional services like SQL Server Data Quality Service (DQS) and SQL Server Master Data Services (MDS) are currently not supported to run on the PDW
    This is also true for more sophisticated tasks and the use of web services for example to find duplicate customers,to correct misspelled street names, to guess the gender from the first name. Also if you need to use custom assemblies, for example to access special source systems or include specific calculations, ETL tools are the better choice.

 

Then, we’re using ELT (distributed SQL, DSQL) on the Parallel Data Warehouse to

  • process fact data (large amount of data) after it is bulk loaded with no modifications into a staging database on the PDW
    • Data preparation (for example removing duplicate rows)
    • Linking fact table data to dimensions
    • Performing calculations (using SQL window functions intensively)
  • Merge new data to archive
    • store the data in the persisted stage area (without creating duplicates if the data was there already)

Why?

  • Much better performance observed compared to SMP SQL Server/SSIS
    • in our case, usually about 10-20 times faster, depending on the source data and the transformations
    • In some cases (for example removing duplicate rows in the source data) even 100 times faster
  • Faster loads allow us to fully reload many TB in case this is needed (this gives more options for backup strategies and for the dimensional modeling)
  • Solution will directly benefit from future MPP scale up without any need of coding

 

CONCLUSION

ETL and ELT may work well together. In this scenario we did the dimension processing as well as the full workflow management using ETL tools (SSIS on an SMP SQL Server) and the processing of the large transactional tables using ELT (distributed SQL on PDW).