Hilmar Buchta

SQL Server 2012

One of the improvements with SQL Server 2012 are the advanced window functions which allow us for example to refer to a record before or after the current record in a result set. Since databases are getting more and more powerful, standard ETL processes may not be the approach with the highest performance when it comes to complex transformations involving a lot of data. In these cases, SQL may be executed faster, especially on a massively parallel processing platform like the Parallel Data Warehouse.

To illustrate this point, let’s assume you have some sensor data which looks like this:

image

Our sensors report information about temperature and throughput at an interval of approximately 15 minutes. The sensors also count errors and report them if they occurred.

Now, image you are a data scientist and by intuition you think it’s possible that a certain problem may depend on the time since the last failure occurred. In order to prove this, you want to set up a mining model. But what you need is the time difference to the latest failure. I don’t want to concentrate on the data mining part here, but more on the calculation based on the time.

You could imagine creating an ETL process (for example a script task) to compute this, but in this post I will show an approach that uses T-SQL. One obvious solution is to use cursors, but since they are slow and we have a large row number of sensor data, we need to find a different approach.

In the first step I’m adding a column that shows the number of seconds between the current measurement and the last measurement. This can be done using the lag function. For the following T-SQL query, I also include the number of seconds that have passed:

select
    sensorid
    , [datetime]
    , lag([DateTime],1) over (partition by sensorid order by [DateTime]) LastDateTime
    , datediff(
        second
        , lag([DateTime],1) over (partition by sensorid order by [DateTime])
        , [DateTime]
    ) SecondsPassed
from [dbo].[SensorData]

Here is what the result looks like:

image

The partitioning part of the lag function is used here to make sure, that we start all over again whenever a new sensor is reached. Since there is no prior record in this case, we’re getting a NULL result.

So you can easily create time deltas using this technique. But of course, this alone doesn’t solve our problem. What we needed is a running total over the seconds. This is still not very complicated. But the running total has to go back to 0 whenever a failure happens. In order to solve this problem, one approach could be to first create a running total over the error count (the total number of errors per sensor). Since this number changes only when an error occurred, we can use this total error number as the partition for a window function calculating the running total of the seconds. Sounds confusing? Well, let’s do it step by step. The first step is to include the total number of errors per sensor. For simplicity I removed the other columns from my query, so this is how we could compute the total errors:

select
    sensorid
    , [datetime]       
    , datediff(
        second
        , lag([DateTime],1) over (partition by sensorid order by [DateTime])
        , [DateTime]
        ) SecondsPassed
    , [FailCount]
    , sum(FailCount) over (partition by sensorid order by [DateTime]) TotalFailCount
–into #Tmp
from [dbo].[SensorData]

 

image

This also shows the power of the window functions, as this is a very elegant way to create a running total. The partition clause (partition by sensorid) makes sure that the running total starts with 0 on every new sensor. Now, in order to compute the number of seconds since the last error, we can use this running total as the partition (together with the sensorid). To show this, I’m going to write the result from the query above to a temporary table by removing the — in front of “into #Tmp”.

I can now refer to the computed running total in the partition expression of the running total of the SecondsPassed column:

select
    sensorid
    , [datetime]       
    , [FailCount]
    , TotalFailCount
    , SecondsPassed
    , sum(SecondsPassed)
       
over (
            partition by sensorid, TotalFailCount order by [DateTime]
        )
SecondsUntilLastFailure

    from #Tmp

And here is the result:

image

So, whenever an error occured , the running total starts all over again. Isn’t this cool? Well, you might say, it would be better if the SecondsUntilLastFailure would start with 0 in case of an error. In order for this to happen, we have to subtract the first value in each partition. Sounds complicated? Well, we’re lucky, since there is a first_value function! So, here is the final code for our query:

select
    sensorid
    , [datetime]       
    , [FailCount]
    , TotalFailCount
    , SecondsPassed
    ,
sum(SecondsPassed)
        over (
            partition by sensorid, TotalFailCount order by [DateTime]
        )
      –
      first_value(SecondsPassed)
        over (
            partition by sensorid, TotalFailCount order by [DateTime]
        ) SecondsUntilLastFailure
    from #Tmp

And here is the result, now starting with a delta of zero seconds on an error:

image

As you can see, the column SecondsUntilLastFailure computes correctly now. Whenever a failure occurred, the counter goes down to zero and then increment until the next error happens. Of course, you could also compute the next (not the prior) time stamp to avoid the need for the first_value – function, but I wanted to show this here too.

So,  T-SQL windowing functions are a very powerful feature to SQL and can be used in quite a lot of scenarios. I’m getting back to an interesting scenario in my next post. And there is still a lot more to discover with functions like row_number, rank, denserank, ntile and the common aggregation functions like sum, max, min or avg.