MS SSAS

Same measure in different granularity

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

In an earlier post I wrote about handling different granularity in a dimension-fact relationship. This time I want to get back to this topic from the end-user perspective. To illustrate this, I use a very simple data model here:

Unbenannt5

For this model, we have daily revenue data and monthly revenue plan data. Here is the link from the measure groups to the dimensions:

Unbenannt6

In this example, both measure groups only have one measure: revenue. Since measures are a dimension of its own in the cube, measure names have to be distinct, so I named the measures “Revenue” and “Plan Revenue”.

Unbenannt7

If your cube contains more data from other measure groups, things can soon be getting confusion with many measure groups. This post is about two common solutions in order to reduce the number of measure groups here.

Two approaches are shown here:

  • Combining both measures in one measure group by using a calculated measure
  • Introducing a scenario dimension

Before we start, one remark. The last post was about using translations in the cube the add more flexibility regarding the naming of the measures. Since the measures are clearly associated with a measure group you could be tempted to use the same name for both measures (which is possible for translations). This is how the result would look like in the cube browser:

Unbenannt8

However, when using a Pivot client like Excel you usually don’t see the measure group anymore. A simple analysis could look like this:

Unbenannt9

So from your Pivot table you cannot tell which is the actual and which is the plan revenue. Therefore I recommend not to use the same name for different measures in the different measure groups.

Combining both measures in one measure group by using a calculated measure

One possible option is to add a calculated measure the the Revenue measure group. In order to do so, I renamed the plan revenue to “Plan Revenue Internal” and set its visibility to hidden. Then the calculation could be as shown below:

Unbenannt10

This is how the cube looks like in the cube browser:

Unbenannt11

By “copying” the plan measure into the revenue measure group you only have one visible measure group left, while both values are still separated by the name.

As I started the post about the different granularity, how does this look like at the detail level? The following screen shot shows the date dimension at the day level:

Unbenannt12

Assuming you have set the “IgnoreUnrelatedDimensions” property of the plan revenue measure group to false, only actual values are displayed here, no plan values. So this approach works really well.

 

Introducing a scenario dimension

Another common way for implementing this situation is to use a scenario dimension. The scenario dimension contains two entries: Actual and Plan. The data from the fact table FactOrder is linked to the scenario member ‘Actual’ while the data from the FactPlan table is linked to the scenario member ‘Plan’. This can be easily done in the data source view (DSV) as shown below for the plan table:

image

The fact tables are linked to the scenario dimension in the data source view:

t1

The scenario dimension should be set IsAggregatable=False (as it makes no sense to aggregate actual and plan data). Also we should provide a default element. This is shown in the screenshot below:

image

For the cube, we only want to have one single measure revenue which is neither of the two existing measures. Therefore we make both existing measures invisible. In order to distinguish them from visible measures, I prefixed them with an underscore:

image

Both measures have their Visible-property set to false. The only visible measure in this case has to be a calculated measure that takes the data from one of the two invisible measures depending on the chosen scenario. Since our scenarios are distinct and not aggregatable, we can simply add both measures (as one of the two is always zero):

image

Choosing this model for implementing the different granularity, the cube shows only one measure:

image

If the scenario dimension is not used, actual values are shown. You can still easily display both values by putting the scenario dimension on one of the axis:

image

If you compare this screenshot with the one from above you will see the same values, only the presentation is a little bit different (first option had two measures, last option one measure and a scenario dimension).

Summary

Both options shown here enable us to have the same measure in two or more measure groups of different granularity. The second option with the scenario dimension looks a little bit more tidy from a technical perspective. However, sometimes the first option is easier to use, especially when you add other calculated measures with formulas that are combining two or more scenarios (for example a measure like ‘Actual to plan ratio’). For those measures it’s not possible to assign them to one scenario. On the other hand, choosing the scenario dimension can dramatically reduce the number of visible measures in your cube, which makes the cube easier to understand for end-users. So depending on the requirements and the structure of the data, one of the two options will be the best to choose.

  • Share/Bookmark

Valid characters for measure names

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

The name for measures in a cube should be clear and easy to understand to avoid confusion for the end-users. However, not all characters are allowed for cube measures. Here is a list of characters, that is not allowed:

. , ; ‘ ` : / \ * | ? " & % $ ! + = ( ) [ ] { } < >

Some of these characters are allowed for calculated measures although I wouldn’t recommend this as it reduces the readability of the your MDX queries (see below).

For example, you cannot give the name “Avg. Sales Amount” to a built-in cube measure (because of the dot). Also sometimes the end-users want to see the unit with the measure name and one frequently used notation for this is to use square brackets like “Utilization [%]” (here the square brackets and the percent-sign violate the list of allowed characters).

On the other hand, many special characters are allowed. The following characters are just an example of the long list of symbols within the normal text:

Ø © α € £ ¥ ® ™ ± ≠ ≤ ≥ ÷ × ∞ µ

For example, it’s quite common to use the Ø-sign as a symbol for the average and you can name your measure “Revenue Ø”. In order to do so, one simple way is to open Microsoft Word and use the Insert Symbol dialog as shown below:

image

After inserting the symbol into a blank Word document you can simply copy it to the clip board and then paste the symbol to the measure name:

Unbenannt

The new measure name is also displayed correctly in OLAP clients like Microsoft Excel:

Unbenannt1

However, the special symbols might be more complicated to enter in certain situations where you cannot easily pick the measure from a list. For example, in MDX you also have to enter these character:

select {[Measures].[Revenue],[Measures].[Revenue Ø]} on 0,
[Dim Date].[Calendar Week].[Year] on 1
from [OLAPSample1]

For this reason I recommend measure names that only consist of simple characters and numbers, for example [AvgRevenue] instead of [Revenue Ø].

But how can we help our end-users to see the measures in the way they expect them to see? One simple solution is to use the cube translation for this purpose (Enterprise feature, you cannot do this on a Standard Edition). Within the cube translation you are not restricted to specific characters, so you can also use all of the above characters, while still keeping a simple technical name for the measure.

To show this, I added some more measures to the cube:

Unbenannt2

On the translation tab of the cube we can now add the displayed names for those measures:

Unbenannt3

And as expected, Excel correctly displays the “translated” measure names:

Unbenannt4

Using translations for the measure names also gives the flexibility to change the name of measures if required without also changing all the queries and calculations that are based on this cube (all queries use the built-in names that are now hidden from the end-user). So, translations are a good method to add a façade to your cube’s measures while also adding some more flexibility in the naming of the measures.

Although we can now be very flexible with the naming of the measures, I still recommend to keep things simple. In general, the unit of the measure is usually clear from the name and the format of the values (for example a percentage value). In this case you should avoid showing the unit with the measure name as it is only redundant in this case.

The next post will be about handling similar measures in different measure groups of different granularity (for example revenue as actual and as planned value). In this case you might question yourself how to name these measures properly. With translations you might be tempted to give two measures the same display name (which is possible!). Without anticipating too much of the next post, I don’t recommend to do this as can be very confusing to work with such a cube.

  • Share/Bookmark

Neues zum Thema BI auf der Microsoft TechEd 2011 in Atlanta

Jörg Plümacher

 

Auf der diesjährigen TechEd in Atlanta gab ein paar äußerst interessante Beiträge zum Thema Datenbanken und Business Intelligence.

Wichtiger Beitrag zum Thema Microsoft BI und wie es mit SSAS und BISM weitergeht:
What’s New in Microsoft SQL Server Code-Named "Denali" for SQL Server Analysis Services and PowerPivot – laut Chris Webb ein “Must see”
http://channel9.msdn.com/Events/TechEd/NorthAmerica/2011/DBI206

Abundantly "Crescent": Demos Galore
http://channel9.msdn.com/Events/TechEd/NorthAmerica/2011/DBI208

Alle Videos zum Thema Business Intelligence – insgesamt 42 Session!! Ich finde hier sieht man das Microsoft das Thema sehr ernst nimmt
http://channel9.msdn.com/Events/TechEd/NorthAmerica/2011?t=database%2B-and-%2Bbusiness%2Bintelligence

  • Share/Bookmark

Profit calculation for churn prevention data mining models (part 3 of 3)

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

The last two posts were about cost optimization for a churn prevention campaign. We analyzed the following four options:

  1. Trivial option (only using average return rate and average customer value)
  2. Data Mining (using average customer value but individual return rate based on a mining model)
  3. Value driven approach (using average return rate but individual customer value)
  4. Combination of method 2 and 3 (using individual return rate and value)

As been said in the last post, the results may differ a lot from case to case. Especially the effectiveness of option 2 and 3 depend a lot on the available information for data mining (for option 2) and the variance of the customer value (for option 3). If both methods give some improvement, then the combination can be expected to be the best choice.

However, we always looked at a scenario where the trivial approach gave almost no insight. What I mean is, that for the trivial approach there is not much difference between giving every customer a voucher compared to giving no customer a voucher (costs between $12,500 and $13,750 in this example). In other words the line chart showing the costs for the trivial example was almost a horizontal line. Here is the corresponding chart from the first post (click on the image to see a larger version):

image

Now, let’s change our basic parameter a little bit. The following table shows the old and the new parameters:

 

Old scenario

New scenario

Avg. Customer Profit

$25

$50

Avg. Churn Rate

50%

80%

Voucher costs (prevention costs)

$10

$5

Over all Churn Rate with prevention

15%

5%

Here are the resulting chart (I’ve copied the chart for the old scenario from my last post):

Old scenario

New scenario

image image

While we clearly see the effect of the optimization in the old scenario (left chart), as the purple line (combined method) has a significant minimum compared to the blue line (trivial approach), in the new scenario (right chart) the minimum of all three methods is almost identical. The trivial method (blue line) has it’s minimal costs at the right end point of the line but although the other methods perform better in the middle range of the chart, they cannot deliver values that are much lower than the right end point of the blue line (for my sample data, even the combined model performs only about 1.2% better than the trivial model from our first approach in this case).

In general, a good optimization is more easy to achieve, if the trivial method has no clear decision (line is almost horizontally oriented). In the new scenario, the profit for each customer is high compared to the prevention costs and the prevention campaign is extremely efficient. In such a case you cannot expect much optimization from your value- or data mining driven approach.

So the first thing to remember from this post is to first check the trivial approach. This is the approach that is almost instantly available for most situations, even if the values are just estimations. If the result is more like the blue line in the left chart (horizontally oriented) it is very likely that you can achieve a significant optimization. If it is more like the blue line in the right chart (either falling or rising) you might only want to check further improvements if the costs for getting the churn score or the individual value are not too high. Otherwise you would risk doing an expensive optimization project with the result, that no optimization is possible.

There is still one open question from the last post and that is about the value for the voucher. Up to this point our success rate for making customers return by using a voucher was a fixed average value based on a test sample. Of course, this does not appear to be sensible: Customer with a high value might even be annoyed by a cheap voucher. Also, customers with a high churn value (likely to go away) might not respond to our voucher campaign in the same way as customers with a low churn rate (customers that are likely to return no matter if there is a voucher or not). So it’s time to add more reality to the model. To do so, we’ll have to analyze the data from our customer test samples (first post) in more detail. We only analyzing by customer value (an advanced model could also analyze by churn score, although a bigger test sample is required then). The following table shows the range for the customer profit, the voucher that was associated with that range and the number of customers who did not return although they received the voucher:

Profit Rage

Number of customers in sample

Value of Voucher

Churn rate (not returning)

0-10

78

0

31%

10-15

55

3

21%

15-25

447

5

17%

25-35

275

10

10%

>35

145

15

7%

With this extra information, we can calculate the costs for the voucher and the probability to prevent the customer from going away more precisely. I’m still using the sort order from my combined model. Here is the result in the chart view. In order to make the difference easier to see I changed the minimum for the y-axis (don’t be fooled by the different presentation):

image

The newly created model is named ‘Variable Model’ in this plot (line color cyan). Because of the different bands the line is not as smooth as for the other models. However it turns out, that this approach is the best one, based on my sample data. Also you’ll notice that the end points of the cyan line differ from the other lines’ endpoints because the costs for the vouchers are no longer a constant. Here are the detailed results from all the approaches:

Trivial Model

Profit Model

Churn Score Model

Combined Model

Variable Model

Minimal costs

12,500.00

11,761.15

11,817.60

11,228.95

10,999.31

Improvement

0.00

738.85

682.40

1,271.05

1,500.69

Improvement %

0.0%

5.9%

5.5%

10.2%

12.0%

After the three posts now it’s time for a short summary:

 

Summary

For cost optimization it’s important to first analyze the actual situation. Optimization is not always possible and before starting an expensive project, it’s better to look at the parameters (as shown above, see remarks for the ‘trivial model’). If you decide for optimization, data mining is not the only option. A value driven approach is as important as a mining model in many scenarios, unless all your customers share the same value. And it’s important to know your parameters as good as possible. In my example, the test sample of customers with and without vouchers was very important. Also keep in mind, that the success rate of a voucher (or any other method of prevention) is not a constant, but depends at least on the customer value (usually computed based on the orders of the past) and the likeliness of the customer to turn away (churn score, usually computed by a mining model). The combination of all these parameters is the key to making the optimization methods more efficient.

And I should also add a warning remark that applies to all the approaches we did here: We always try to model the behavior of the customers in the future based on data from the past. The first four approaches are all based on the same modeling idea (only the last model with the variable voucher is based on different pre-conditions). The only difference is the subset of customers that are addressed (different sort order). However, in all these cases the effect of the optimization may be different in reality and with every model it is important to also validate and constantly refine the model.

  • Share/Bookmark

Profit calculation for churn prevention data mining models (part 2 of 3)

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

In part 1 of this mini series of data mining posts I showed how data mining can be used to optimize a churn prevention campaign by selecting the optimal sub sample of customers to receive a voucher. For this optimization I used a data mining model (without going into detail about the model itself). However, developing such a model is not always an easy task and can result in relevant additional costs. This post first looks at an “alternative” method which you might find easier to implement in the first step. In fact, when facing any kind of optimization or prediction problem, data mining should not be the only option you’re taking into account.

With our data mining approach we were able to transform the average churn rate of the customers (used by the trivial model first) into a customer specific churn rate. By sorting the customers by this individual churn rate we could address those customers first, for which is was most likely that they will not return. This made our investment more efficient.

For the next approach, we will ignore this individual churn score and go back to the average churn score of 50% from the example of my last post. But now we will be looking at a customer specific profit. Depending on your data, profit may be replaced by some other value measure for the customer (revenue, customer life time value etc.).

In the same way as in the previous approach, we can simply sort our customers by their value in descending order and address the customers with the highest value first. Again, the warning from my last post also applies here: The customers with the highest value might not respond to the $10 voucher in the same way as customers with a lower value. I will get back to this point later.

In the same way as before we can calculate the costs per customer:

Option

Costs

Option 1: Customers gets no voucher [average churn rate] x [customer profit]
Option 2: Customer gets voucher [average churn rate] x (1- [prevention success rate]) x [customer profit]
+ [cost of a single voucher]
=
[average churn rate] x [over all prevention success rate] x [customer profit]
+ [cost of a single voucher]

For my sample data, the chart (now including the trivial approach, the churn score approach and the new profit based approach) looks like this:

image

As you can see, for my test data both approaches (green and red line) perform almost in the same way. However, this highly depends on your customer base and I adjusted my data in a way not to prefer one of the two approaches. If there is a low variance in your customer values (almost the same value for every customer) the method based on the value will perform almost as bad as the trivial approach (as it gives no benefit then). If the variance is high, the value based method might easily outperform a purely data mining driven score calculation.

Another good idea would be to combine both methods. Instead of ordering the customers by value or churn score we could sort them by the product [customer values] x [churn score]. In this case both parameters are variables and the calculation looks somewhat like this:

Option

Costs

Option 1: Customers gets no voucher [churn score] x [customer profit]
Option 2: Customer gets voucher [churn score] x (1- [prevention success rate]) x [customer profit]
+ [cost of a single voucher]

For my sample data the chart output of all four methods shows that the combination is superior (which should be expected in most real world scenarios):

image

In my example the optimizations using only profit (value) or only churn score give an improvement by about 5.5% to 6% while the combination (purple line in the chart) gives about 10.2%.

For all of these samples we used the same parameters. How does the whole picture changes, if the parameters are changed (for example the average return rate)? And what may happen if we also adjust the last fixed parameter in this model, the value for the voucher itself? This will be the topic of the third and last part of this series.

  • Share/Bookmark

Profit calculation for churn prevention data mining models (part 1 of 3)

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

Data Mining is frequently used to optimize costs in marketing scenarios. Most marketing campaigns result in costs and we want these campaigns to be as efficient as possible.

To illustrate this process let’s assume we’re operating a web shop. Customers need to register before they can order goods and therefore we know how many customers are returning. While we also spend some afford to win new customers, for this example we will focus on keeping the existing ones and making them buy more in our shop.

From our web shop system we know, that 50% of our customers place only one single order and do not return (within a given period of time, for example within a year). Marketing suggests to offer our existing customers a benefit in order to make them return, let’s say a $10 voucher.

 

The Simple Approach

While we will analyze more sophisticated models later, for now we will take a look at a very simple approach, that does not take differences between individual customers into account. Therefore, we are looking at average values. For example, the average profit for each order is $25.

Having just the information from above, we cannot decide if the idea of the marketing department is a good one. But we could run a test sample. Let’s say we give the voucher to a random sample of 500 customers. Now we can compare customers who got the voucher with customers who didn’t. After this test we find out that

Group 1 (Customers with voucher): 15% do not return
Group 2 (Customers without the voucher): 50% do not return (as we already knew)

So, based on our test sample, the voucher is pretty effective to keep customers returning. We can also calculate the success rate using this formula:

Unbenannt

The success rate is impressive, but is it also sensible from a cost perspective? To answer this, let’s look at two options that we have:

  1. We give vouchers to none of our customers
  2. We give vouchers to all of our customers

For both options, we want to calculate the total costs, including costs from loosing existing customers. For all of the following examples we assume that we only have 1,000 customers and that we are looking at a specific time frame (for example a year). Let’s start with the first option. If we don’t give vouchers to any of our customers, the only costs come from loosing customers and therefore loosing profit. Since 50% of our customers are not returning and the average profit from each customer was $25, we loose 1,000 x 50% x $25 = $12,500 in this scenario. Now let’s look at the second option: From our test sample above we know that only 15% of the customers do not return after receiving a voucher. In the same way as for the first option we can calculate the loss value as 1,000 x 15% x $25 = $3,750. But for this option we also need to add the costs for the vouchers, which is 1,000 x $10 = $10,000. So the total costs for option 2 are $3,750 + $10,000 = $13,750. To sum things up:

Option

Costs

1 vouchers to none of our customers

$ 12,500

2 vouchers to all of our customers

$ 13,750

Of course there would also be the option to offer half of our customers a voucher, but this would just mean that we take the average of the two cost values above as shown in the following chart:

image

In this chart, option 1 corresponds to the left endpoint of the line while option 2 corresponds to the right endpoint of the line. So, although our vouchers are very successful, they don’t make sense here as they result in higher costs. Of course we could think of changing the value for the voucher but we would have to run another test to see how our customer base responds to the new voucher level (I’ll get back to this point in part 3 of this series). In most cases, the random sample method from above would have been done using different groups with different values for the vouchers. The effect on our success rate is usually not linear, so it’s worth finding the best value for a voucher. But we’ll keep things simple here and assume that the $10 voucher is already the best we could do.

 

Data Mining

Did I mention data mining in the title? So far we didn’t do any data mining with on our data. But maybe data mining gives us an option to optimize the costs in our scenario?

The main difference between our analysis from above and the data mining approach (and some other options that I will come to) is that now we are looking at the individual customer. In the first approach we just had an average customer drop-off rate of 50%. With data mining, we’re trying to predict the individual rate/score per customer. To do so, we have to set up a mining model that is trained with the knowledge we have about our customers (for example from the web registration form) and the individual buying behavior in the past. This model can then be used to predict returning customers. As this post is about cost optimization, we’re not actually building the model here but we just look at the output. So let’s assume that every customer now has a churn score associated with the customer record and that the model is validated and tested and performs well (see my other posts about back testing mining models).

We can now calculate the individual costs per customer for our two options from above:

Option

Costs

Option 1: Customers gets no voucher churnscore x [average profit]
Option 2: Customer gets voucher churnscore x (1- [prevention success rate]) x [average profit]
+ [cost of a single voucher]

For the first option, the calculation is simple. We only take the expectancy value of the profit loss here. If the individual churn score is for example 30% (0.3) then the costs in this calculation are 30% x $25 = $7.5. Please note that although we are looking at a data mining approach here, I’m still using the average profit of the customer. We will change this later to see the additional effect.

For the second option, we’re using our prevention success rate (70% from above). Even with the voucher, some customers will still not be returning. And as in the example from above, we have to add the costs of the voucher itself. If the individual churn score is 30% (0.3 as in the example above), then this case would result in the following costs: 30% x (100%-70%) x $25  + $10 = $12,25.

For my example I’m using randomly generated test data (with a modified random function). As long as our average rates from above match with the score we’re using here, it is obvious that both approaches give the same result for the two decisions from above: no customers gets a voucher and every customer gets a voucher. So how do we use our churn score now? The answer is simple: Instead of addressing for example 50% of the customers we could now address 50% of the customers with the highest churn scores as we assume that it is more likely for them to go away.

The chart from above may now look like this:

image

You can clearly see the effect of the optimization here (green line). As we’re talking about costs, the lower the better. The green line showing our data mining approach has a minimum at appox. 22%. The value at that point is $11,818. So, using the data mining approach we could save $682 based on our sample set of 1,000 customers, compared to the total costs of $12,500 for the trivial approach. This is an improvement of about 5.45%. Remember though, that we’re only looking at sample data here, so for your scenario the output may be totally different. Also, customers with a high churn rate might not respond to our vouchers in the same way as customers with a lower churn rate.

In the next part of this post we’ll focus more on the profit value of this scenario.

  • Share/Bookmark

BOL Community Article about Back-Testing Data Mining Results

Hilmar Buchta

SQL Server 2008 | SQL Server 2008 R2

In addition to my blog posts about back-testing Data Mining Results (see part 1, part 2 and part 3) I also wrote a Books Online Community Article that covers the topic at a more detailed level. The article can be downloaded here: http://msdn.microsoft.com/en-us/library/gg557481.aspx. The sample data and solution is available for download. The link is provided in the article.

You will find all the Community Articles for SQL Server 2008 here: http://msdn.microsoft.com/en-us/library/cc872864(SQL.100).aspx

  • Share/Bookmark

Cumulated Gains Chart and Lift Chart in SSRS

Hilmar Buchta

SQL Server 2008 | SQL Server 2008 R2

This post is about reproducing some of the mining charts that are built into Business Intelligence Development Studio using Reporting Services. A common use case might be that you periodically train your model and the results of the training should be published as html reports.

For my example I use the Adventure Works Targeted Mailing mining structure. We want to plot the results from the “TM Decision Tree Model”. First let’s start with the lift chart. In BIDS, the chart for “Bike Buyer=1” looks like this:

image

In order to reproduce the chart, we’re going to use the SystemGetLiftTable stored function in SSAS. Since we need to do some computations, I prefer to use SQL, so we have to define a linked server first:

EXEC sp_addlinkedserver
@server=’SSAS_AW’,
@srvproduct=”,
@provider=’MSOLAP’,
@datasrc=’Adventure Works DW 2008R2′

The query for loading the results from the lift chart looks like this:

SELECT percentile,
       VALUE,
       CASE
         WHEN percentile >= 100 * CONVERT(FLOAT, totalattributevaluecases) /
                            totalcases
       THEN 100
         ELSE percentile / (( CONVERT(FLOAT, totalattributevaluecases) /
                              totalcases ))
       END idealmodel
FROM   Openquery(ssas_aw,
‘CALL SystemGetLiftTable([TM Decision Tree], 2, ”Bike Buyer”, 1)’) AS
derivedtbl_1
 

For the chart, we use the percentile of the lift table as category group (x-axis) and choose these entries for the values (y-axis):

  • Sum(Value)
    This is the actual lift curve for our mining model
  • Sum(Percentile)
    As this is the same value as on the x-axis, this gives the random guess model line (linear)
  • Sum(IdealModel)
    The case-when-end statement above reflects the curve for the ideal model (linear function from 0 to 1 between 0 and totalattribute/totalcases)

This is how the chart looks like in the SSRS designer:

image

There are different names for this kind of a chart. For this post I’m using the naming conventions as shown here: http://www2.cs.uregina.ca/~hamilton/courses/831/notes/lift_chart/lift_chart.html, so the chart is labeled “Cumulated Gains Chart” instead of lift chart (BIDS).

So, here is the resulting Cumulated Gains Chart from my SSRS report:

image

As you can see, the chart looks pretty much the same as the chart that was displayed in BIDS.

Another useful chart is the actual lift factor of the model, often referred to as the ‘lift chart’. For this chart, we use the same SSAS function but we need to compute the lift in our query: 

SELECT percentile,
       VALUE / percentile AS lift,
       1                  AS randomguess
FROM   Openquery(ssas_aw,
‘CALL SystemGetLiftTable([TM Decision Tree], 2, ”Bike Buyer”, 1)’) AS
derivedtbl_1
 

This is how our chart looks like:

 image

Finally, we also want to reproduce the classification matrix (often referred to as confusion matrix). This is how the matrix looks like in BIDS:

image

To get these results, we use the function SystemGetClassificationMatrix as shown below:

SELECT predictedvalue,
       SUM(CASE actualvalue
             WHEN 0 THEN [COUNT]
             ELSE 0
           END) actual_0,
       SUM(CASE actualvalue
             WHEN 1 THEN [COUNT]
             ELSE 0
           END) actual_1
FROM   Openquery(ssas_aw,
‘CALL SystemGetClassificationMatrix( [TM Decision Tree], 2, ”Bike Buyer”)’
) AS derivedtbl_1
GROUP  BY predictedvalue
 

The second parameter of the function SystemGetClassificationMatrix means
1 – training data
2 – test data
3 – both (training and test data)

I used a simple matrix on my report to display the result from the query above and applied some coloring in order to distinguish error cases from correct cases:

image

To complete this post, I’d like to point out that you can also query your model cases directly, using a query like

SELECT *, IsTestCase() As TestCase FROM [TM Decision Tree].CASES

This query returns all the cases from the TM Decision Tree model with an additional column “TestCase” that contains true, of the case belongs to the test data set. In order to run this query you need to enable the drill-through option for the mining model. You can find out more about the SSAS mining functions in this post: http://www.bogdancrivat.net/dm/archives/14

  • Share/Bookmark

Do you trust your Data Mining results? – Part 3/3

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

This is the third and last post of my Data Mining back testing mini series. For my first posts I made the assumption that no action is taken based on the prediction. Although the mining process marked some of our contracts with a high probability for being cancelled, we did nothing to prevent our customers from doing so.

Of course, this is not the real life scenario. Usually our goal is churn prevention (or some other action based on the mining results). However, any action we take will (hopefully) change the behavior of our customers. Or to be more precise, we expect that less customers are really cancelling their contract. This is always a challenge for our mining process in total as the prevention may “dry out” our mining source data. After a while we might not have any non-influenced customers left with a high probability for cancelling the contract. A test group would help but in many scenarios this is not wanted (because the customers are just too important to risk loosing them just for the sake of some statistical validation). Another method would be to leave the influenced customers out of the training set. This doesn’t disturb our prediction but do we still know our prevention is successful enough?

For our back testing process, we can also add the expected prevention rate to our Monte Carlo procedure. In this case we’re not only trying to validate the model but also the prevention rate. On the other hand, if our test fails we’re not sure if it’s the model or our assumed prevention rate that is actually wrong.

For the following I assume that every customer with a churn probability greater then 35% gets prevention. From the past we know that 90% of our preventions are successful. Here are the results:

Old model (no prevention included):
image

New model (including prevention):
image

As expected the curves have shifted to the left. In the new model (including the prevention) the number of customers cancelling the contract is significantly lower than in the old model.

We can still use the same methods as shown in the first post to derive the threshold value for our test. This is the curve for the alpha and beta error (again: for beta for choose an alternative model that goes off by 3%).

image

Again, here are some values for the threshold T, alpha and beta:

T

Alpha

Beta

1580

88.7

0.0

1600

75.6

0.1

1620

56.7

0.4

1640

36.5

1.6

1660

19.5

5.1

1680

8.4

13.3

1700

3.0

27.0

1720

0.9

46.0

1740

0.2

65.7

For example with T=1680 we get alpha<10% and beta<14%. But as being said above, if our model fails the test we’re now not sure anymore if it is the model or the prevention rate. Again, the best way to avoid this uncertainty would be to use test groups. For example, we could exclude a random sample of 10% of all contracts with a churn score above 35% from prevention. Then we’re able to compare the behavior of the random sample (test group) with the customers that received prevention benefits. This test group enables us to measure the effectiveness of our prevention activity. And of course you could set up a statistical test in the same way we described here, to proof that the test group really supports a certain success rate.

To wrap things up, depending on the products, processes, the way new customers are acquired etc., the real-life data mining problems may be much more difficult than presented in this small blog series. Anyway, the methods described here, like the Monte Carlo algorithm or the statistical hypothesis test, are the main building blocks for achieving a reliable understanding how well the model performs. So, depending on the actual scenario, the above mentioned methods may be combined, adjusted and repeated to get the desired results.

This also ends this mini series about back testing. The basic methods are more or less ubiquitous and may also be used for several similar problems, for example to do the back testing for a financial risk model. By adjusting and combining the methods, a specific business requirement may be modeled into a reliable test. And as long as our model conforms to the test, we can have enough trust in our data mining models to use them as the basis for business decisions.

  • Share/Bookmark

Do you trust your Data Mining results? – Part 2/3

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

While the first part of this post was more about the idea and interpreting the results of the test, this part shows how to implement the Monte Carlo test.

First, we need a table with the predicted data mining probabilities. This is the output of the PredictProbability function from your mining result query. I’m using the same source data as in my previous post here. If you like you can easily create your own table and populate it with random probability values in order to test the code for the simulation below:

CREATE TABLE [dbo].[Mining_Result](
    [CaseKey] [int] NOT NULL,
    [PredictScore] [float] NULL
) ON [PRIMARY]

declare @i int=0

    while (@i<10000) begin

        insert into Mining_Result(CaseKey, PredictScore)
        values(@i, convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5)

        set @i=@i+1

end

Don’t be confused by the convert(…cast…cast newid()…) expression. This is just my approach to calculate a random number within an SQL select statement.

Next we need a table for storing our Mining results:

CREATE TABLE [dbo].[Mining_Histogram](
    [NumCases] [int] NOT NULL,
    [Count] [int] NULL,
    [Perc] [float] NULL,
    [RunningPerc] [float] NULL,
CONSTRAINT [PK_DM_Histogram] PRIMARY KEY CLUSTERED
(
    [NumCases] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

 

Then this is how we’re doing our Monte Carlo test:

truncate table Mining_Histogram;

declare @numtrials int = 10000;
declare @cnt int;
declare @lp int;

set @lp=0;

– perform a monte carlo test:

while (@lp<@numtrials) begin

    select @cnt=COUNT(*) from Mining_Result where PredictScore >
        convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5        
    if exists(select NumCases from Mining_Histogram Where NumCases=@cnt)    
            update Mining_Histogram set [Count]=[Count]+1 Where NumCases=@cnt
        else
            insert into Mining_Histogram(NumCases,[Count]) values (@cnt, 1)
    set @lp=@lp+1;   

end

I’m using the same trick for the random numbers as shown above. In this example, we’re doing 10,000 iterations. For each iterations we compute the number of cases for which the Predicted Score is higher than a random number. For example, if for a certain case the predict score is 0.8 it is more likely that a random number between 0.0 and 1.0 is below the score than for a prediction score of 0.1.

Next, we’re filling the gaps in our histogram table with zeros to make the histogram look nicer:

declare @min int;
declare @max int;
select @min=MIN(NumCases), @max=MAX(NumCases) from Mining_Histogram

set @lp=@min;
while (@lp<@max) begin
    if not exists(select NumCases From Mining_Histogram Where NumCases=@lp)
        insert into Mining_Histogram(NumCases,[Count]) values (@lp, 0);
    set @lp=@lp+1
end

Finally we’re computing the row probability and the running total using this T-SQL:

declare @maxcount float;
select @maxcount=SUM([Count]) from Mining_Histogram;
update Mining_Histogram Set Perc=[Count]/@maxcount;

declare @CaseIdx int
declare @perc float
declare @RunningTotal float =0

DECLARE rt_cursor CURSOR FOR select NumCases, Perc From Mining_Histogram
OPEN rt_cursor

FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc

WHILE @@FETCH_STATUS = 0
BEGIN
  SET @RunningTotal = @RunningTotal + @perc
  update Mining_Histogram set RunningPerc=@RunningTotal Where NumCases=@CaseIdx
  FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc
END

CLOSE rt_cursor
DEALLOCATE rt_cursor

 

After running the simulation this is how the plots of the result look like (using my own values). The first plot shows the value of the field NumCases on the x-axis and the value of the field perc on the y-axis. The second plot has the same x-axis but shows the RunningPerc field on the y-axis:

image

 image

These two plots look very much the same as the plots from my last post (although I used C# code there to generate the histogram data).

If you used the randomly generated scores from above for testing, you will notice the peak being around 5000 cases (instead of 2800 cases in my example).

And if you like a smoother version of the density function  (as all the teeth and bumps mainly result from Monte Carlo approach), you could use this SQL query to compute a moving average:

declare @minrange int=0
declare @windowsize int = 50

select @minrange=Min(NumCases) from Mining_Histogram

SELECT     H.NumCases, AVG(H1.[Count]) [Count], AVG(H1.Perc) Perc
FROM         Mining_Histogram H
left join Mining_Histogram H1 on H1.NumCases between H.NumCases-@windowsize and H.NumCases

where H.NumCases>@minrange+@windowsize

group by H.NumCases

image

In order to do the histogram computation automatically with prediction query I recommend putting the code in an SSIS script component. I would also use another type of random number generator. This also allows you to set the seed for the random number generator. For my implementation I used an asynchronous script component that first loads all cases into memory (ArrayList collection), then performs the Monte Carlo test on the in-memory data and then writes the results back to the output buffer. This allows you do work with more scenarios and to log the progress during the loading and testing phase of the component.

I’m planning to write a Books Online Community Technical article on this topic. This article will be more detailed regarding the implementation. I will post a link to this article in my blog then.

  • Share/Bookmark

  • Kategorien

  • Copyright © ORAYLIS Blog. All rights reserved.