René Schwarz

DataQualityGuard_Dataloss_Title

The transfer of a big amount of data usually is quite time consuming. Given the fact that in many cases the transfer is performed over several environments (e.g. prod. environment, stage and cube), the time required for a complete transfer often easily sums up to 10 hours and more. Considering the fact, that a successful data transfer relies on many aspects – such as stable network connections, a proper population of the source tables and so on – the number of transferred records might differ from the expected result: a complete set, covering the total number of records in the productive environment. This is the first blog article of a series, in which I will try to provide a possible solution for the tracking of such deviations.

In order to find deviations in the transferred number of records in time and react on these deviations as soon as possible (instead of after a long period of waiting for the loading to be completed), it might be helpful to have checks being automatically performed after each transfer step. In case of a noteworthy deviation the transfer can be stopped, the problem analysed and – finally – the transfer restarted. This can easily be achieved by defining queries, which count the required number of records or calculate the required values in each environment and compare them to each other.

One example might be: “Get the number of yesterday’s orders in the frontroom environment via SQL and the number of yesterday’s orders in the cube via MDX and calculate the deviation between both values. If the deviation is less that 0.5%, the check is successful; if the deviation is between 0.5% and 1.5%, produce a warning; if the deviation exceeds 1.5%, produce a critical warning!”

There are other comparisons as well, which might be of interest: “Calculate the average number of daily sales of the last month and compare it to yesterday’s number of sales”. This comparison will most certainly require other limits as the daily comparison described before.

In order to have these comparisons being performed automatically, you might want to define queries for each of the values required for the desired comparisons and store them in a Database. Additionally it might be helpful to define limits: which deviation is acceptable and which is out of bounds and needs attention. Furthermore you might want to use SSIS in order to load these queries from the Database and execute them against the corresponding source environments:

data_quality_guard_exec_from_variable

The package should store the results to the Database and perform the calculation of the deviations. With this done, the recently calculated deviations would have to be compared against the defined limits and it has to be decided, which severity has to be assigned to each deviation.

The next step would be to decide, at what time of which day of week this SSIS package has to be executed. You might have transfers, which are executed on Sundays only – so there would be no need to perform any checks on any other day of the week. By defining parameters for your SSIS package you can execute your package for dedicated comparisons, where each execution is steered by the scheduled day and time.

Finally, in order to have a quick and easy to grab overview of the result, you might want to define a SSRS report, which displays the results grouped by severity and maintain parameterized subscriptions for these reports in order to receive the results automatically via mail.

In my last project we had to compare quite a number of sources with various restrictions / filters. Some queries were quite similar to others and required only parameterisation in order to be reused. Some environments required SQL queries, others MDX queries. In my next article I will describe, how we accomplished this task.