Author Archive

Moving Average in DAX / BISM Tabular

Hilmar Buchta

SQL Server Denali | PowerPivot

Alberto Ferrari already wrote about calculating moving averages in DAX by using a calculated column. I’d like to present a different approach here by using a calculated measure. For the moving average I’m calculating a daily moving average (over the last 30 days) here.

For my example, I’m using the PowerPivot workbook which can be downloaded as part of the SSAS Tabular Model Projects from the Denali CTP 3 samples.

In this post, I’m developing the formula step by step. However, if you are in a hurry, you might directly want to jump to the final results below.

With calendar year 2003 on the filter, date on columns and sales amount (from table Internet Sales) in the details, the sample data looks like this:

t1

In each row’s context, the expression ‘Date’[Date] gives the current context, i.e. the date for this row. But from a calculated measure we cannot refer to this expression (as there is no current row for the Date table), instead we have to use an expression like LastDate(‘Date’[Date]).

So, in order to get the last thirty days we can use this expression

DatesInPeriod(‘Date’[Date],LastDate(‘Date’[Date]),-30,DAY)

We can now summarize our internet sales for each of those days by using the summarize function:

Summarize(
  DatesInPeriod(‘Date’[Date],LastDate(‘Date’[Date]),-30,DAY)
  ,’Date’[Date]
  , "SalesAmountSum"
  , Sum(‘Internet Sales’[Sales Amount])
)

And finally, we’re using the DAX function AverageX to compute the average of those 30 values:

Sales Amount (30d avg):=AverageX(
  Summarize(
    DatesInPeriod(‘Date’[Date],LastDate(‘Date’[Date]),-30,DAY)
    ,’Date’[Date]
    , "SalesAmountSum"
    , Sum(‘Internet Sales’[Sales Amount])
  )
  ,[SalesAmountSum]
)

This is the calculation that we are using in our Internet Sales table as shown in the screenshot below:

t2

When adding this calculation to the pivot table from above, the result looks like this:

image

Looking at the result it seems that we don’t have any data prior to January 1, 2003: The first value for the moving average is identical to the day value (there are no rows before that date). The second value for the moving average is actually the average of the first two days and so on. This is not quite correct but I’m getting back to this problem in a second. The screenshot shows the computation for the moving average of at January 31 as the average of the daily values from January 2 to 31.

Our calculated measure also works fine when filters are applied. In the following screenshot I used two product categories for the data series:

t3

How does our calculated measure work on higher aggregation levels? In order to find out, I’m using the Calendar hierarchy on the rows (instead of the date). For simplicity I removed the semester and quarter levels using Excel’s pivot table options (Show/Hide fields option).

t4

As you can see, the calculation still works fine. Here, the monthly aggregate is the moving average for the last day of the specific month. You can see this clearly for January (value of 14,215.01 also appears in the screenshot above as the value for January 31). If this was the business requirement (which sounds reasonable for a daily average), then the aggregation works fine on a monthly level (otherwise we will have to fine tune our calculation and this will be a topic of am upcoming post).

But although the aggregation makes sense on a monthly level, if we expand this view to the day level you’ll see that our calculated measure simply returns the sales amount for that day, not the average of the last 30 days anymore:

t5

How can this be. The problem results from the context in which we calculate our sum, as highlighted in the following code:

Sales Amount (30d avg):=AverageX(
  Summarize(
    datesinperiod(‘Date’[Date],LastDate(‘Date’[Date]),-30,DAY)
    ,’Date’[Date]
    , "SalesAmountSum"
    , Sum(‘Internet Sales’[Sales Amount])
  )
 
,[SalesAmountSum]
)

Since we evaluate this expression over the given dates period, the only context that is overwritten here, is ‘Date’[Date]. In our hierarchy we’re using different attributes from our dimension (Calendar Year, Month and Day Of Month). As this context is still present, the calculation is also filtered by those attributes. And this explains why we the current day’s context is still present for each line. To get things clear, as long as we evaluate this expression outside of a date context, everything is fine as the following DAX query shows when being executed by Management Studio on the Internet Sales perspective of our model (using the tabular database with the same data):

evaluate (
    Summarize(
        datesinperiod(‘Date’[Date],date(2003,1,1),-5,DAY)
        ,’Date’[Date]
        , "SalesAmountSum"
        , Sum(‘Internet Sales’[Sales Amount])
    )
)

Here, I reduced the time period to 5 days and also set a fixed date as LastDate(…) would result in the last date of my date dimension table for which no data is present in the sample data. Here is the result from the query:

t6

However, after setting a filter to 2003, no data rows outside of 2003 will be included in the sum. This explains the remark above: It looked like we only have data starting from January 1, 2003. And now, we know why: The year 2003 was on the filter (as you can see in the very first screen shot of this post) and therefore it was present when calculating the sum. Now, all we have to do is to get rid of those additional filters because we’re already filtering our results by Date. The easiest way to do so, is to use the Calculate function and apply ALL(…) for all attributes for which we want to remove the filter. As we have some of those attributes (Year, Month, Day, Weekday, …) and we want to remove the filter from all of them but the date attribute, the shortcut function ALLEXCEPT is very useful here.

If you do have an MDX background you will wonder why we don’t get a similar problem when using SSAS in OLAP mode (BISM Multidimensional). The reason is that our OLAP database has attribute relationships, so after setting the date (key) attribute, the other attributes are automatically changed too and we don’t have to take care about this (see my post here). But in the tabular model we don’t have attribute relationships (not even a true key attribute) and therefore we need to eliminate unwanted filters from our calculations.

So here we are with the …

 

Final results

Sales Amount (30d avg):=AverageX(
  Summarize(
    datesinperiod(‘Date’[Date],LastDate(‘Date’[Date]),-30,DAY)
    ,’Date’[Date]
    , "SalesAmountSum"
    , calculate(Sum(‘Internet Sales’[Sales Amount]), ALLEXCEPT(‘Date’,'Date’[Date]))
  )
,[SalesAmountSum]
)

And this is our final pivot table in Excel:

t7

To illustrate the moving average, here is the same extract of data in a chart view (Excel):

t8

Although we filtered our data on 2003 the moving average for the first 29 days of 2003 correctly takes the corresponding days of 2002 into account. You will recognize the values for January 30 and 31 from our first approach as these were the first days for which our first calculation had a sufficient amount of data (full 30 days).

  • Share/Bookmark

Semi additive measures in DAX / BISM Tabular

Hilmar Buchta

SQL Server Denali | PowerPivot

Semi additive measures, i.e. measures that have to be aggregated differently over different dimensions, are commonly used in BI solutions. One example could be stock levels. Of course we don’t want to sum them up over time, but only over product, location etc. For the time, a different aggregation is used, for example average or last value.

The following example shows how to implement some of the most commonly used semi additive measures in DAX.

In my example I’m using PowerPivot (Denali edition), but the same calculations can be used in a BISM Tabular model in Visual Studio.

In order to keep things simple, I’m using just a short table of test data:

p1

As you see, we only have two products with monthly stock levels in 2010 and 2011.

Although not needed for my semi additive measures, I created additional columns in my PowerPivot sheet for convenient reasons: Year, Month, Day (using the corresponding DAX-function with the same name). I also set the newly created columns, as well as the Stocklevel column to hidden (it makes no sense to sum up the stock level). Although the date information is kept in the same table as the data to keep things simple for this example, I encourage to build a separate date dimension table here (similar idea as with a date dimension in a multidimensional model).

Finally, I created a hierarchy named ‘Calendar’ on my newly created date columns:

p1

Now we’re ready for the semi additive measures.

Average (over time)

Let’s start with an easy one, the average over time. Since we can easily compute the distinct count of our date values, we can simply add up the stock level and divide it by the distinct count. In my example the formula looks like this:

Avg Stock:=Sum([Stocklevel])/DISTINCTCOUNT([Date])

 

Last value (over time)

In order to compute the last value, the DAX function LASTDATE comes in handy. Here is the formula:

Last Stock:=CALCULATE(SUM([Stocklevel]),LASTDATE(‘Stock’[Date]))

 

Min/Max value (over time)

For min/max we have to be a little bit more tricky. In the approach I’m showing here, I’m grouping the table by date by using the SUMMARIZE function and the SUM aggregation. Then I’m using the function MINX or MAXX to find the minimal or maximal value.

Here are the two formulas:

Max Stock:=MINX(SUMMARIZE(‘Stock’,'Stock’[Date],"SumByDate",SUM(‘Stock’[Stocklevel])),[SumByDate])

Max Stock:=MAXX(SUMMARIZE(‘Stock’,'Stock’[Date],"SumByDate",SUM(‘Stock’[Stocklevel])),[SumByDate])

To understand these formulas you can see the effect pretty well after restoring the PowerPivot workbook to a SSAS server in tabular mode. After doing so, we can create a query to show the result of the inner SUMMARIZE function using this DAX query:

evaluate(
SUMMARIZE(‘Stock’,'Stock’[Date],"SumByDate",SUM(‘Stock’[Stocklevel]))
)

Here’s the result:

t4

The MinX or MaxX function simply takes the lowest/highest value from this table.

Now let’s see, how this looks like in Excel. The following screenshot shows the calculations in my PowerPivot sheet:

t2

 

Here’s the result in Excel

t1

And of course, the aggregations also work correctly when filtering the data as shown below (single select on product and multi select on months):

t3

Another cool feature is that besides DAX we can still use standard MDX to query our SSAS tabular model, for example:

select
{[Measures].[Avg Stock],[Measures].[Last Stock],
[Measures].[Min Stock],[Measures].[Max Stock]} on 0,
[Stock].[Calendar].[Year] on 1
from [Model]

t5

A final word about empty (missing) rows: The above calculations need a value of zero as the information that there is no stock at that month. If the value is left blank (no source data row at all), the month itself is treated as missing (interpretation more like we didn’t have this product in our portfolio at all).

  • Share/Bookmark

Parallel hierarchies in a parent-child structure

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

This post is about a problem I faced some years ago. The source system was SAP with user defined hierarchies, in this case within the cost center and cost type tables. Parallel hierarchies are well supported in SQL Server BI but in this case, users in SAP could define multiple hierarchies on their own and they wanted these hierarchies to be also available in the OLAP cube. For example, costs associated with the cost center 1000 should be analyzed as shown below:

hierarchy

All costs that are booked on cost center 1000 have to appear in the hierarchy as shown in the sketch. And end-users may also be able to create new hierarchies (for example to analyze a certain project). Of course there may be better ways to model this but in this case we had basically two tables for the cost centers:

  • table CC (Cost Center)
  • table CCG (Cost Center Group)

Table CC contains all cost centers (for example the above cost center 1000) together with additional information (like name, responsible person etc.) while table CCG contains the hierarchy. in CCG we basically find two columns:

  • Node name
  • Parent node name
Node Parent Node
1000 Internal_HR_DE
Internal_HR_DE HR_DE
HR_DE Germany
1000 Marketing_DE
Marketing_DE Germany
Germany Corporate
Marketing_DE Marketing
Marketing Corporate

Facts (in this case actual or planned costs) are associated with the cost center number (for example 1000). Usually, parent-child hierarchies may be used in this case where we have a very dynamic structure and we do not know the number of levels. However, parent-child may only be used if each node has at most one parent. But here we find the cost center 1000 having two parents (Internal HR Costs DE and Marketing_DE). The same situation exists with the Marketing_DE node (having parents Marketing and Germany).

The solution I’m presenting here is to create additional nodes until each node only has one parent. This is possible as each node of a parent-child hierarchy in SSAS has a name and a key property. So, the name will be identical, while the key will be different. In order to show the process, let’s add internal keys to each of the hierarchy elements.

hierarchy1

What we have to do now is to create additional nodes for every node that has more than one parent. Let’s start with the ‘Marketing_DE’ node:

hierarchy2

The additional node gets a new (internal) key, in this example the number 8. But there is still a node with multiple parents: the cost center 1000. Let’s also transform this into separate nodes:

hierarchy3

After this step, each node has at most one parent and therefore the structure can be modeled as an SSAS parent-child hierarchy. 

But now, we have to think about the fact rows. Without the hierarchy, facts would have been associated to the cost center by using the internal key (surrogate key), so for example 1000 € that are booked on cost center 1000 would appear in the fact table like

DateKey CostCenterKey Amount
20110630 1 1000

But now, we have to associate this single fact row to three rows in the dimension table (as the cost center 1000 appears three times now). Therefore we have to use a many-to-many approach, so we add another table, a so called bridge table with the following rows:

CostCenterKey CostCenterDimKey
1 1
1 9
1 10

For technical reasons, our fact table has to be linked to a dimension (of flat cost centers), which is also used by the bridge table. This is shown in the following image:

hierarchy4

The most difficult part here is to “normalize” the parent-child structure. One way to do this is to use a stored procedure. Here is the code I used. Within this procedure, the following tables are used:

masterdata.Costcenter the flat table of cost centers (only leaf-level). The key field is the cost center number (for example 1000 for our cost center from above)
masterdata.CostcenterGroup the hierarchy structure as shown above
ods.CostcenterGroupExpanded Output table: the expanded tree containing the fields of the table masterdata.Costcenter plus the following additional fields:

CostcenterKey the new generated surrogate key
ParentKey the key of the parent node
Level a technical field used during iteration

Here is the code:

CREATE PROCEDURE [dbo].[ExpandCostcenterGroup]
AS
    SET NOCOUNT ON
    truncate table ods.costcenterGroupExpanded
   
    declare @level int
    declare @affectedrows int
    declare @totalrowcount int
       
    set @level=0
   
    insert into ods.costcenterGroupExpanded(costcentergroup,Parentgroup,Description1,Description2,Responsibility,AccountingArea)   
     Select distinct costcenterGroup, Parent,Description1,Description2,Responsibility,AccountingArea from masterdata.costcentergroup
       
    — Initialize all keys   
    update ods.costcenterGroupExpanded
    set ParentKey=(select min(costcenterkey) from ods.costcenterGroupExpanded where costcenterGroup=c.Parentgroup)   
    from ods.costcenterGroupExpanded as c
    where not c.ParentGroup is null       
   
    set @affectedrows=1   
   
    while @affectedrows>0
    begin
   
        Set @level=@level+1
       
        set @totalrowcount=(select Count(*) from ods.costcenterGroupExpanded)
   
        insert into  ods.costcenterGroupExpanded(costcentergroup,Parentgroup,ParentKey,"level",Description1,Description2,Responsibility,AccountingArea)
        select distinct cparent.costcentergroup, cparent.Parentgroup,cparent.costcenterKey,@level,cparent.Description1,cparent.Description2,cparent.Responsibility,cparent.AccountingArea
        from ods.costcenterGroupExpanded as cparent inner join ods.costcenterGroupExpanded as cchild on
        cparent.Parentgroup=cchild.costcenterGroup
        where cparent.ParentKey!=cchild.costcenterKey
        and cchild."Level"=@level-1
       
        set @affectedrows=@@rowcount
       
    end
   
    return

GO

To keep things simple, I truncate the output table CostcenterGroupExpanded here. However, there is a drawback with this approach: The surrogate keys may change after changes of the imported source tables. This will result in a problem for example for the Excel users. If you’re using filters like ‘show this element only’, only the key is stored.

In order to avoid this you will need to store the mapping and the assigned surrogate key separately. Here it is necessary not only to store the combination of cost center/parent cost center/surrogate key but the whole branch up to the root instead. If you look at the example above you will find two entries of ‘Cost Center 1000’ –> ‘Marketing_DE’, so this is not unique. You have to store the full path up to the root for each node (not only for the leaf-nodes) to make it unique:

Path Given Surrogate Key
1000 –> Internal_HR_DE –> HR_DE –> Germany –> Corporate 1
1000 –> Marketing_DE –> Germany –> Corporate 9
1000 –> Marketing_DE –> Marketing –> Corporate 10
Marketing_DE –> Germany –> Corporate 7
Marketing_DE –> Marketing –> Corporate 8
 

In order to store the full path up to the root level I recommend using a hash code (MD5 for example) as this is easier to handle as a long list of node names. In this case our additional key store table would look like this

PathMD5 Given Surrogate Key
417913d10ef49f5ff90db9db9f3d2569 1
8e27be6b156a52016e01dc049bc39126 9
52b1bcaec016e09d4086f37e63814aa5 10
 

The sample code above does not manage this key store table so the keys may change a lot on each load. But for practical purposes you will have to add this key management to make sure the same node always gets the same key.

  • Share/Bookmark

Same measure in different granularity

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

In an earlier post I wrote about handling different granularity in a dimension-fact relationship. This time I want to get back to this topic from the end-user perspective. To illustrate this, I use a very simple data model here:

Unbenannt5

For this model, we have daily revenue data and monthly revenue plan data. Here is the link from the measure groups to the dimensions:

Unbenannt6

In this example, both measure groups only have one measure: revenue. Since measures are a dimension of its own in the cube, measure names have to be distinct, so I named the measures “Revenue” and “Plan Revenue”.

Unbenannt7

If your cube contains more data from other measure groups, things can soon be getting confusion with many measure groups. This post is about two common solutions in order to reduce the number of measure groups here.

Two approaches are shown here:

  • Combining both measures in one measure group by using a calculated measure
  • Introducing a scenario dimension

Before we start, one remark. The last post was about using translations in the cube the add more flexibility regarding the naming of the measures. Since the measures are clearly associated with a measure group you could be tempted to use the same name for both measures (which is possible for translations). This is how the result would look like in the cube browser:

Unbenannt8

However, when using a Pivot client like Excel you usually don’t see the measure group anymore. A simple analysis could look like this:

Unbenannt9

So from your Pivot table you cannot tell which is the actual and which is the plan revenue. Therefore I recommend not to use the same name for different measures in the different measure groups.

Combining both measures in one measure group by using a calculated measure

One possible option is to add a calculated measure the the Revenue measure group. In order to do so, I renamed the plan revenue to “Plan Revenue Internal” and set its visibility to hidden. Then the calculation could be as shown below:

Unbenannt10

This is how the cube looks like in the cube browser:

Unbenannt11

By “copying” the plan measure into the revenue measure group you only have one visible measure group left, while both values are still separated by the name.

As I started the post about the different granularity, how does this look like at the detail level? The following screen shot shows the date dimension at the day level:

Unbenannt12

Assuming you have set the “IgnoreUnrelatedDimensions” property of the plan revenue measure group to false, only actual values are displayed here, no plan values. So this approach works really well.

 

Introducing a scenario dimension

Another common way for implementing this situation is to use a scenario dimension. The scenario dimension contains two entries: Actual and Plan. The data from the fact table FactOrder is linked to the scenario member ‘Actual’ while the data from the FactPlan table is linked to the scenario member ‘Plan’. This can be easily done in the data source view (DSV) as shown below for the plan table:

image

The fact tables are linked to the scenario dimension in the data source view:

t1

The scenario dimension should be set IsAggregatable=False (as it makes no sense to aggregate actual and plan data). Also we should provide a default element. This is shown in the screenshot below:

image

For the cube, we only want to have one single measure revenue which is neither of the two existing measures. Therefore we make both existing measures invisible. In order to distinguish them from visible measures, I prefixed them with an underscore:

image

Both measures have their Visible-property set to false. The only visible measure in this case has to be a calculated measure that takes the data from one of the two invisible measures depending on the chosen scenario. Since our scenarios are distinct and not aggregatable, we can simply add both measures (as one of the two is always zero):

image

Choosing this model for implementing the different granularity, the cube shows only one measure:

image

If the scenario dimension is not used, actual values are shown. You can still easily display both values by putting the scenario dimension on one of the axis:

image

If you compare this screenshot with the one from above you will see the same values, only the presentation is a little bit different (first option had two measures, last option one measure and a scenario dimension).

Summary

Both options shown here enable us to have the same measure in two or more measure groups of different granularity. The second option with the scenario dimension looks a little bit more tidy from a technical perspective. However, sometimes the first option is easier to use, especially when you add other calculated measures with formulas that are combining two or more scenarios (for example a measure like ‘Actual to plan ratio’). For those measures it’s not possible to assign them to one scenario. On the other hand, choosing the scenario dimension can dramatically reduce the number of visible measures in your cube, which makes the cube easier to understand for end-users. So depending on the requirements and the structure of the data, one of the two options will be the best to choose.

  • Share/Bookmark

Valid characters for measure names

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

The name for measures in a cube should be clear and easy to understand to avoid confusion for the end-users. However, not all characters are allowed for cube measures. Here is a list of characters, that is not allowed:

. , ; ‘ ` : / \ * | ? " & % $ ! + = ( ) [ ] { } < >

Some of these characters are allowed for calculated measures although I wouldn’t recommend this as it reduces the readability of the your MDX queries (see below).

For example, you cannot give the name “Avg. Sales Amount” to a built-in cube measure (because of the dot). Also sometimes the end-users want to see the unit with the measure name and one frequently used notation for this is to use square brackets like “Utilization [%]” (here the square brackets and the percent-sign violate the list of allowed characters).

On the other hand, many special characters are allowed. The following characters are just an example of the long list of symbols within the normal text:

Ø © α € £ ¥ ® ™ ± ≠ ≤ ≥ ÷ × ∞ µ

For example, it’s quite common to use the Ø-sign as a symbol for the average and you can name your measure “Revenue Ø”. In order to do so, one simple way is to open Microsoft Word and use the Insert Symbol dialog as shown below:

image

After inserting the symbol into a blank Word document you can simply copy it to the clip board and then paste the symbol to the measure name:

Unbenannt

The new measure name is also displayed correctly in OLAP clients like Microsoft Excel:

Unbenannt1

However, the special symbols might be more complicated to enter in certain situations where you cannot easily pick the measure from a list. For example, in MDX you also have to enter these character:

select {[Measures].[Revenue],[Measures].[Revenue Ø]} on 0,
[Dim Date].[Calendar Week].[Year] on 1
from [OLAPSample1]

For this reason I recommend measure names that only consist of simple characters and numbers, for example [AvgRevenue] instead of [Revenue Ø].

But how can we help our end-users to see the measures in the way they expect them to see? One simple solution is to use the cube translation for this purpose (Enterprise feature, you cannot do this on a Standard Edition). Within the cube translation you are not restricted to specific characters, so you can also use all of the above characters, while still keeping a simple technical name for the measure.

To show this, I added some more measures to the cube:

Unbenannt2

On the translation tab of the cube we can now add the displayed names for those measures:

Unbenannt3

And as expected, Excel correctly displays the “translated” measure names:

Unbenannt4

Using translations for the measure names also gives the flexibility to change the name of measures if required without also changing all the queries and calculations that are based on this cube (all queries use the built-in names that are now hidden from the end-user). So, translations are a good method to add a façade to your cube’s measures while also adding some more flexibility in the naming of the measures.

Although we can now be very flexible with the naming of the measures, I still recommend to keep things simple. In general, the unit of the measure is usually clear from the name and the format of the values (for example a percentage value). In this case you should avoid showing the unit with the measure name as it is only redundant in this case.

The next post will be about handling similar measures in different measure groups of different granularity (for example revenue as actual and as planned value). In this case you might question yourself how to name these measures properly. With translations you might be tempted to give two measures the same display name (which is possible!). Without anticipating too much of the next post, I don’t recommend to do this as can be very confusing to work with such a cube.

  • Share/Bookmark

Profit calculation for churn prevention data mining models (part 3 of 3)

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

The last two posts were about cost optimization for a churn prevention campaign. We analyzed the following four options:

  1. Trivial option (only using average return rate and average customer value)
  2. Data Mining (using average customer value but individual return rate based on a mining model)
  3. Value driven approach (using average return rate but individual customer value)
  4. Combination of method 2 and 3 (using individual return rate and value)

As been said in the last post, the results may differ a lot from case to case. Especially the effectiveness of option 2 and 3 depend a lot on the available information for data mining (for option 2) and the variance of the customer value (for option 3). If both methods give some improvement, then the combination can be expected to be the best choice.

However, we always looked at a scenario where the trivial approach gave almost no insight. What I mean is, that for the trivial approach there is not much difference between giving every customer a voucher compared to giving no customer a voucher (costs between $12,500 and $13,750 in this example). In other words the line chart showing the costs for the trivial example was almost a horizontal line. Here is the corresponding chart from the first post (click on the image to see a larger version):

image

Now, let’s change our basic parameter a little bit. The following table shows the old and the new parameters:

 

Old scenario

New scenario

Avg. Customer Profit

$25

$50

Avg. Churn Rate

50%

80%

Voucher costs (prevention costs)

$10

$5

Over all Churn Rate with prevention

15%

5%

Here are the resulting chart (I’ve copied the chart for the old scenario from my last post):

Old scenario

New scenario

image image

While we clearly see the effect of the optimization in the old scenario (left chart), as the purple line (combined method) has a significant minimum compared to the blue line (trivial approach), in the new scenario (right chart) the minimum of all three methods is almost identical. The trivial method (blue line) has it’s minimal costs at the right end point of the line but although the other methods perform better in the middle range of the chart, they cannot deliver values that are much lower than the right end point of the blue line (for my sample data, even the combined model performs only about 1.2% better than the trivial model from our first approach in this case).

In general, a good optimization is more easy to achieve, if the trivial method has no clear decision (line is almost horizontally oriented). In the new scenario, the profit for each customer is high compared to the prevention costs and the prevention campaign is extremely efficient. In such a case you cannot expect much optimization from your value- or data mining driven approach.

So the first thing to remember from this post is to first check the trivial approach. This is the approach that is almost instantly available for most situations, even if the values are just estimations. If the result is more like the blue line in the left chart (horizontally oriented) it is very likely that you can achieve a significant optimization. If it is more like the blue line in the right chart (either falling or rising) you might only want to check further improvements if the costs for getting the churn score or the individual value are not too high. Otherwise you would risk doing an expensive optimization project with the result, that no optimization is possible.

There is still one open question from the last post and that is about the value for the voucher. Up to this point our success rate for making customers return by using a voucher was a fixed average value based on a test sample. Of course, this does not appear to be sensible: Customer with a high value might even be annoyed by a cheap voucher. Also, customers with a high churn value (likely to go away) might not respond to our voucher campaign in the same way as customers with a low churn rate (customers that are likely to return no matter if there is a voucher or not). So it’s time to add more reality to the model. To do so, we’ll have to analyze the data from our customer test samples (first post) in more detail. We only analyzing by customer value (an advanced model could also analyze by churn score, although a bigger test sample is required then). The following table shows the range for the customer profit, the voucher that was associated with that range and the number of customers who did not return although they received the voucher:

Profit Rage

Number of customers in sample

Value of Voucher

Churn rate (not returning)

0-10

78

0

31%

10-15

55

3

21%

15-25

447

5

17%

25-35

275

10

10%

>35

145

15

7%

With this extra information, we can calculate the costs for the voucher and the probability to prevent the customer from going away more precisely. I’m still using the sort order from my combined model. Here is the result in the chart view. In order to make the difference easier to see I changed the minimum for the y-axis (don’t be fooled by the different presentation):

image

The newly created model is named ‘Variable Model’ in this plot (line color cyan). Because of the different bands the line is not as smooth as for the other models. However it turns out, that this approach is the best one, based on my sample data. Also you’ll notice that the end points of the cyan line differ from the other lines’ endpoints because the costs for the vouchers are no longer a constant. Here are the detailed results from all the approaches:

Trivial Model

Profit Model

Churn Score Model

Combined Model

Variable Model

Minimal costs

12,500.00

11,761.15

11,817.60

11,228.95

10,999.31

Improvement

0.00

738.85

682.40

1,271.05

1,500.69

Improvement %

0.0%

5.9%

5.5%

10.2%

12.0%

After the three posts now it’s time for a short summary:

 

Summary

For cost optimization it’s important to first analyze the actual situation. Optimization is not always possible and before starting an expensive project, it’s better to look at the parameters (as shown above, see remarks for the ‘trivial model’). If you decide for optimization, data mining is not the only option. A value driven approach is as important as a mining model in many scenarios, unless all your customers share the same value. And it’s important to know your parameters as good as possible. In my example, the test sample of customers with and without vouchers was very important. Also keep in mind, that the success rate of a voucher (or any other method of prevention) is not a constant, but depends at least on the customer value (usually computed based on the orders of the past) and the likeliness of the customer to turn away (churn score, usually computed by a mining model). The combination of all these parameters is the key to making the optimization methods more efficient.

And I should also add a warning remark that applies to all the approaches we did here: We always try to model the behavior of the customers in the future based on data from the past. The first four approaches are all based on the same modeling idea (only the last model with the variable voucher is based on different pre-conditions). The only difference is the subset of customers that are addressed (different sort order). However, in all these cases the effect of the optimization may be different in reality and with every model it is important to also validate and constantly refine the model.

  • Share/Bookmark

Profit calculation for churn prevention data mining models (part 2 of 3)

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

In part 1 of this mini series of data mining posts I showed how data mining can be used to optimize a churn prevention campaign by selecting the optimal sub sample of customers to receive a voucher. For this optimization I used a data mining model (without going into detail about the model itself). However, developing such a model is not always an easy task and can result in relevant additional costs. This post first looks at an “alternative” method which you might find easier to implement in the first step. In fact, when facing any kind of optimization or prediction problem, data mining should not be the only option you’re taking into account.

With our data mining approach we were able to transform the average churn rate of the customers (used by the trivial model first) into a customer specific churn rate. By sorting the customers by this individual churn rate we could address those customers first, for which is was most likely that they will not return. This made our investment more efficient.

For the next approach, we will ignore this individual churn score and go back to the average churn score of 50% from the example of my last post. But now we will be looking at a customer specific profit. Depending on your data, profit may be replaced by some other value measure for the customer (revenue, customer life time value etc.).

In the same way as in the previous approach, we can simply sort our customers by their value in descending order and address the customers with the highest value first. Again, the warning from my last post also applies here: The customers with the highest value might not respond to the $10 voucher in the same way as customers with a lower value. I will get back to this point later.

In the same way as before we can calculate the costs per customer:

Option

Costs

Option 1: Customers gets no voucher [average churn rate] x [customer profit]
Option 2: Customer gets voucher [average churn rate] x (1- [prevention success rate]) x [customer profit]
+ [cost of a single voucher]
=
[average churn rate] x [over all prevention success rate] x [customer profit]
+ [cost of a single voucher]

For my sample data, the chart (now including the trivial approach, the churn score approach and the new profit based approach) looks like this:

image

As you can see, for my test data both approaches (green and red line) perform almost in the same way. However, this highly depends on your customer base and I adjusted my data in a way not to prefer one of the two approaches. If there is a low variance in your customer values (almost the same value for every customer) the method based on the value will perform almost as bad as the trivial approach (as it gives no benefit then). If the variance is high, the value based method might easily outperform a purely data mining driven score calculation.

Another good idea would be to combine both methods. Instead of ordering the customers by value or churn score we could sort them by the product [customer values] x [churn score]. In this case both parameters are variables and the calculation looks somewhat like this:

Option

Costs

Option 1: Customers gets no voucher [churn score] x [customer profit]
Option 2: Customer gets voucher [churn score] x (1- [prevention success rate]) x [customer profit]
+ [cost of a single voucher]

For my sample data the chart output of all four methods shows that the combination is superior (which should be expected in most real world scenarios):

image

In my example the optimizations using only profit (value) or only churn score give an improvement by about 5.5% to 6% while the combination (purple line in the chart) gives about 10.2%.

For all of these samples we used the same parameters. How does the whole picture changes, if the parameters are changed (for example the average return rate)? And what may happen if we also adjust the last fixed parameter in this model, the value for the voucher itself? This will be the topic of the third and last part of this series.

  • Share/Bookmark

CeBIT 2011 in Hannover from March 1-5, 2011

Hilmar Buchta

The CeBIT in Hannover, Germany is a great time to learn about new technology, share information and discuss about new topics. As for the last years we (ORAYLIS GmbH) are also present and I would be glad to welcome you at our booth that is located within the Microsoft area: Fair Hall 4, Booth 26 / P30. I will be at the booth from March 2 (Wednesday) to March 5 (Saturday).

Hope to see you at the CeBIT!

  • Share/Bookmark

Profit calculation for churn prevention data mining models (part 1 of 3)

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

Data Mining is frequently used to optimize costs in marketing scenarios. Most marketing campaigns result in costs and we want these campaigns to be as efficient as possible.

To illustrate this process let’s assume we’re operating a web shop. Customers need to register before they can order goods and therefore we know how many customers are returning. While we also spend some afford to win new customers, for this example we will focus on keeping the existing ones and making them buy more in our shop.

From our web shop system we know, that 50% of our customers place only one single order and do not return (within a given period of time, for example within a year). Marketing suggests to offer our existing customers a benefit in order to make them return, let’s say a $10 voucher.

 

The Simple Approach

While we will analyze more sophisticated models later, for now we will take a look at a very simple approach, that does not take differences between individual customers into account. Therefore, we are looking at average values. For example, the average profit for each order is $25.

Having just the information from above, we cannot decide if the idea of the marketing department is a good one. But we could run a test sample. Let’s say we give the voucher to a random sample of 500 customers. Now we can compare customers who got the voucher with customers who didn’t. After this test we find out that

Group 1 (Customers with voucher): 15% do not return
Group 2 (Customers without the voucher): 50% do not return (as we already knew)

So, based on our test sample, the voucher is pretty effective to keep customers returning. We can also calculate the success rate using this formula:

Unbenannt

The success rate is impressive, but is it also sensible from a cost perspective? To answer this, let’s look at two options that we have:

  1. We give vouchers to none of our customers
  2. We give vouchers to all of our customers

For both options, we want to calculate the total costs, including costs from loosing existing customers. For all of the following examples we assume that we only have 1,000 customers and that we are looking at a specific time frame (for example a year). Let’s start with the first option. If we don’t give vouchers to any of our customers, the only costs come from loosing customers and therefore loosing profit. Since 50% of our customers are not returning and the average profit from each customer was $25, we loose 1,000 x 50% x $25 = $12,500 in this scenario. Now let’s look at the second option: From our test sample above we know that only 15% of the customers do not return after receiving a voucher. In the same way as for the first option we can calculate the loss value as 1,000 x 15% x $25 = $3,750. But for this option we also need to add the costs for the vouchers, which is 1,000 x $10 = $10,000. So the total costs for option 2 are $3,750 + $10,000 = $13,750. To sum things up:

Option

Costs

1 vouchers to none of our customers

$ 12,500

2 vouchers to all of our customers

$ 13,750

Of course there would also be the option to offer half of our customers a voucher, but this would just mean that we take the average of the two cost values above as shown in the following chart:

image

In this chart, option 1 corresponds to the left endpoint of the line while option 2 corresponds to the right endpoint of the line. So, although our vouchers are very successful, they don’t make sense here as they result in higher costs. Of course we could think of changing the value for the voucher but we would have to run another test to see how our customer base responds to the new voucher level (I’ll get back to this point in part 3 of this series). In most cases, the random sample method from above would have been done using different groups with different values for the vouchers. The effect on our success rate is usually not linear, so it’s worth finding the best value for a voucher. But we’ll keep things simple here and assume that the $10 voucher is already the best we could do.

 

Data Mining

Did I mention data mining in the title? So far we didn’t do any data mining with on our data. But maybe data mining gives us an option to optimize the costs in our scenario?

The main difference between our analysis from above and the data mining approach (and some other options that I will come to) is that now we are looking at the individual customer. In the first approach we just had an average customer drop-off rate of 50%. With data mining, we’re trying to predict the individual rate/score per customer. To do so, we have to set up a mining model that is trained with the knowledge we have about our customers (for example from the web registration form) and the individual buying behavior in the past. This model can then be used to predict returning customers. As this post is about cost optimization, we’re not actually building the model here but we just look at the output. So let’s assume that every customer now has a churn score associated with the customer record and that the model is validated and tested and performs well (see my other posts about back testing mining models).

We can now calculate the individual costs per customer for our two options from above:

Option

Costs

Option 1: Customers gets no voucher churnscore x [average profit]
Option 2: Customer gets voucher churnscore x (1- [prevention success rate]) x [average profit]
+ [cost of a single voucher]

For the first option, the calculation is simple. We only take the expectancy value of the profit loss here. If the individual churn score is for example 30% (0.3) then the costs in this calculation are 30% x $25 = $7.5. Please note that although we are looking at a data mining approach here, I’m still using the average profit of the customer. We will change this later to see the additional effect.

For the second option, we’re using our prevention success rate (70% from above). Even with the voucher, some customers will still not be returning. And as in the example from above, we have to add the costs of the voucher itself. If the individual churn score is 30% (0.3 as in the example above), then this case would result in the following costs: 30% x (100%-70%) x $25  + $10 = $12,25.

For my example I’m using randomly generated test data (with a modified random function). As long as our average rates from above match with the score we’re using here, it is obvious that both approaches give the same result for the two decisions from above: no customers gets a voucher and every customer gets a voucher. So how do we use our churn score now? The answer is simple: Instead of addressing for example 50% of the customers we could now address 50% of the customers with the highest churn scores as we assume that it is more likely for them to go away.

The chart from above may now look like this:

image

You can clearly see the effect of the optimization here (green line). As we’re talking about costs, the lower the better. The green line showing our data mining approach has a minimum at appox. 22%. The value at that point is $11,818. So, using the data mining approach we could save $682 based on our sample set of 1,000 customers, compared to the total costs of $12,500 for the trivial approach. This is an improvement of about 5.45%. Remember though, that we’re only looking at sample data here, so for your scenario the output may be totally different. Also, customers with a high churn rate might not respond to our vouchers in the same way as customers with a lower churn rate.

In the next part of this post we’ll focus more on the profit value of this scenario.

  • Share/Bookmark

BOL Community Article about Back-Testing Data Mining Results

Hilmar Buchta

SQL Server 2008 | SQL Server 2008 R2

In addition to my blog posts about back-testing Data Mining Results (see part 1, part 2 and part 3) I also wrote a Books Online Community Article that covers the topic at a more detailed level. The article can be downloaded here: http://msdn.microsoft.com/en-us/library/gg557481.aspx. The sample data and solution is available for download. The link is provided in the article.

You will find all the Community Articles for SQL Server 2008 here: http://msdn.microsoft.com/en-us/library/cc872864(SQL.100).aspx

  • Share/Bookmark

  • Kategorien

  • Copyright © ORAYLIS Blog. All rights reserved.