Tag: Data Mining

Profit calculation for churn prevention data mining models (part 3 of 3)

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

The last two posts were about cost optimization for a churn prevention campaign. We analyzed the following four options:

  1. Trivial option (only using average return rate and average customer value)
  2. Data Mining (using average customer value but individual return rate based on a mining model)
  3. Value driven approach (using average return rate but individual customer value)
  4. Combination of method 2 and 3 (using individual return rate and value)

As been said in the last post, the results may differ a lot from case to case. Especially the effectiveness of option 2 and 3 depend a lot on the available information for data mining (for option 2) and the variance of the customer value (for option 3). If both methods give some improvement, then the combination can be expected to be the best choice.

However, we always looked at a scenario where the trivial approach gave almost no insight. What I mean is, that for the trivial approach there is not much difference between giving every customer a voucher compared to giving no customer a voucher (costs between $12,500 and $13,750 in this example). In other words the line chart showing the costs for the trivial example was almost a horizontal line. Here is the corresponding chart from the first post (click on the image to see a larger version):

image

Now, let’s change our basic parameter a little bit. The following table shows the old and the new parameters:

 

Old scenario

New scenario

Avg. Customer Profit

$25

$50

Avg. Churn Rate

50%

80%

Voucher costs (prevention costs)

$10

$5

Over all Churn Rate with prevention

15%

5%

Here are the resulting chart (I’ve copied the chart for the old scenario from my last post):

Old scenario

New scenario

image image

While we clearly see the effect of the optimization in the old scenario (left chart), as the purple line (combined method) has a significant minimum compared to the blue line (trivial approach), in the new scenario (right chart) the minimum of all three methods is almost identical. The trivial method (blue line) has it’s minimal costs at the right end point of the line but although the other methods perform better in the middle range of the chart, they cannot deliver values that are much lower than the right end point of the blue line (for my sample data, even the combined model performs only about 1.2% better than the trivial model from our first approach in this case).

In general, a good optimization is more easy to achieve, if the trivial method has no clear decision (line is almost horizontally oriented). In the new scenario, the profit for each customer is high compared to the prevention costs and the prevention campaign is extremely efficient. In such a case you cannot expect much optimization from your value- or data mining driven approach.

So the first thing to remember from this post is to first check the trivial approach. This is the approach that is almost instantly available for most situations, even if the values are just estimations. If the result is more like the blue line in the left chart (horizontally oriented) it is very likely that you can achieve a significant optimization. If it is more like the blue line in the right chart (either falling or rising) you might only want to check further improvements if the costs for getting the churn score or the individual value are not too high. Otherwise you would risk doing an expensive optimization project with the result, that no optimization is possible.

There is still one open question from the last post and that is about the value for the voucher. Up to this point our success rate for making customers return by using a voucher was a fixed average value based on a test sample. Of course, this does not appear to be sensible: Customer with a high value might even be annoyed by a cheap voucher. Also, customers with a high churn value (likely to go away) might not respond to our voucher campaign in the same way as customers with a low churn rate (customers that are likely to return no matter if there is a voucher or not). So it’s time to add more reality to the model. To do so, we’ll have to analyze the data from our customer test samples (first post) in more detail. We only analyzing by customer value (an advanced model could also analyze by churn score, although a bigger test sample is required then). The following table shows the range for the customer profit, the voucher that was associated with that range and the number of customers who did not return although they received the voucher:

Profit Rage

Number of customers in sample

Value of Voucher

Churn rate (not returning)

0-10

78

0

31%

10-15

55

3

21%

15-25

447

5

17%

25-35

275

10

10%

>35

145

15

7%

With this extra information, we can calculate the costs for the voucher and the probability to prevent the customer from going away more precisely. I’m still using the sort order from my combined model. Here is the result in the chart view. In order to make the difference easier to see I changed the minimum for the y-axis (don’t be fooled by the different presentation):

image

The newly created model is named ‘Variable Model’ in this plot (line color cyan). Because of the different bands the line is not as smooth as for the other models. However it turns out, that this approach is the best one, based on my sample data. Also you’ll notice that the end points of the cyan line differ from the other lines’ endpoints because the costs for the vouchers are no longer a constant. Here are the detailed results from all the approaches:

Trivial Model

Profit Model

Churn Score Model

Combined Model

Variable Model

Minimal costs

12,500.00

11,761.15

11,817.60

11,228.95

10,999.31

Improvement

0.00

738.85

682.40

1,271.05

1,500.69

Improvement %

0.0%

5.9%

5.5%

10.2%

12.0%

After the three posts now it’s time for a short summary:

 

Summary

For cost optimization it’s important to first analyze the actual situation. Optimization is not always possible and before starting an expensive project, it’s better to look at the parameters (as shown above, see remarks for the ‘trivial model’). If you decide for optimization, data mining is not the only option. A value driven approach is as important as a mining model in many scenarios, unless all your customers share the same value. And it’s important to know your parameters as good as possible. In my example, the test sample of customers with and without vouchers was very important. Also keep in mind, that the success rate of a voucher (or any other method of prevention) is not a constant, but depends at least on the customer value (usually computed based on the orders of the past) and the likeliness of the customer to turn away (churn score, usually computed by a mining model). The combination of all these parameters is the key to making the optimization methods more efficient.

And I should also add a warning remark that applies to all the approaches we did here: We always try to model the behavior of the customers in the future based on data from the past. The first four approaches are all based on the same modeling idea (only the last model with the variable voucher is based on different pre-conditions). The only difference is the subset of customers that are addressed (different sort order). However, in all these cases the effect of the optimization may be different in reality and with every model it is important to also validate and constantly refine the model.

  • Share/Bookmark

Profit calculation for churn prevention data mining models (part 2 of 3)

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

In part 1 of this mini series of data mining posts I showed how data mining can be used to optimize a churn prevention campaign by selecting the optimal sub sample of customers to receive a voucher. For this optimization I used a data mining model (without going into detail about the model itself). However, developing such a model is not always an easy task and can result in relevant additional costs. This post first looks at an “alternative” method which you might find easier to implement in the first step. In fact, when facing any kind of optimization or prediction problem, data mining should not be the only option you’re taking into account.

With our data mining approach we were able to transform the average churn rate of the customers (used by the trivial model first) into a customer specific churn rate. By sorting the customers by this individual churn rate we could address those customers first, for which is was most likely that they will not return. This made our investment more efficient.

For the next approach, we will ignore this individual churn score and go back to the average churn score of 50% from the example of my last post. But now we will be looking at a customer specific profit. Depending on your data, profit may be replaced by some other value measure for the customer (revenue, customer life time value etc.).

In the same way as in the previous approach, we can simply sort our customers by their value in descending order and address the customers with the highest value first. Again, the warning from my last post also applies here: The customers with the highest value might not respond to the $10 voucher in the same way as customers with a lower value. I will get back to this point later.

In the same way as before we can calculate the costs per customer:

Option

Costs

Option 1: Customers gets no voucher [average churn rate] x [customer profit]
Option 2: Customer gets voucher [average churn rate] x (1- [prevention success rate]) x [customer profit]
+ [cost of a single voucher]
=
[average churn rate] x [over all prevention success rate] x [customer profit]
+ [cost of a single voucher]

For my sample data, the chart (now including the trivial approach, the churn score approach and the new profit based approach) looks like this:

image

As you can see, for my test data both approaches (green and red line) perform almost in the same way. However, this highly depends on your customer base and I adjusted my data in a way not to prefer one of the two approaches. If there is a low variance in your customer values (almost the same value for every customer) the method based on the value will perform almost as bad as the trivial approach (as it gives no benefit then). If the variance is high, the value based method might easily outperform a purely data mining driven score calculation.

Another good idea would be to combine both methods. Instead of ordering the customers by value or churn score we could sort them by the product [customer values] x [churn score]. In this case both parameters are variables and the calculation looks somewhat like this:

Option

Costs

Option 1: Customers gets no voucher [churn score] x [customer profit]
Option 2: Customer gets voucher [churn score] x (1- [prevention success rate]) x [customer profit]
+ [cost of a single voucher]

For my sample data the chart output of all four methods shows that the combination is superior (which should be expected in most real world scenarios):

image

In my example the optimizations using only profit (value) or only churn score give an improvement by about 5.5% to 6% while the combination (purple line in the chart) gives about 10.2%.

For all of these samples we used the same parameters. How does the whole picture changes, if the parameters are changed (for example the average return rate)? And what may happen if we also adjust the last fixed parameter in this model, the value for the voucher itself? This will be the topic of the third and last part of this series.

  • Share/Bookmark

Profit calculation for churn prevention data mining models (part 1 of 3)

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008R2

Data Mining is frequently used to optimize costs in marketing scenarios. Most marketing campaigns result in costs and we want these campaigns to be as efficient as possible.

To illustrate this process let’s assume we’re operating a web shop. Customers need to register before they can order goods and therefore we know how many customers are returning. While we also spend some afford to win new customers, for this example we will focus on keeping the existing ones and making them buy more in our shop.

From our web shop system we know, that 50% of our customers place only one single order and do not return (within a given period of time, for example within a year). Marketing suggests to offer our existing customers a benefit in order to make them return, let’s say a $10 voucher.

 

The Simple Approach

While we will analyze more sophisticated models later, for now we will take a look at a very simple approach, that does not take differences between individual customers into account. Therefore, we are looking at average values. For example, the average profit for each order is $25.

Having just the information from above, we cannot decide if the idea of the marketing department is a good one. But we could run a test sample. Let’s say we give the voucher to a random sample of 500 customers. Now we can compare customers who got the voucher with customers who didn’t. After this test we find out that

Group 1 (Customers with voucher): 15% do not return
Group 2 (Customers without the voucher): 50% do not return (as we already knew)

So, based on our test sample, the voucher is pretty effective to keep customers returning. We can also calculate the success rate using this formula:

Unbenannt

The success rate is impressive, but is it also sensible from a cost perspective? To answer this, let’s look at two options that we have:

  1. We give vouchers to none of our customers
  2. We give vouchers to all of our customers

For both options, we want to calculate the total costs, including costs from loosing existing customers. For all of the following examples we assume that we only have 1,000 customers and that we are looking at a specific time frame (for example a year). Let’s start with the first option. If we don’t give vouchers to any of our customers, the only costs come from loosing customers and therefore loosing profit. Since 50% of our customers are not returning and the average profit from each customer was $25, we loose 1,000 x 50% x $25 = $12,500 in this scenario. Now let’s look at the second option: From our test sample above we know that only 15% of the customers do not return after receiving a voucher. In the same way as for the first option we can calculate the loss value as 1,000 x 15% x $25 = $3,750. But for this option we also need to add the costs for the vouchers, which is 1,000 x $10 = $10,000. So the total costs for option 2 are $3,750 + $10,000 = $13,750. To sum things up:

Option

Costs

1 vouchers to none of our customers

$ 12,500

2 vouchers to all of our customers

$ 13,750

Of course there would also be the option to offer half of our customers a voucher, but this would just mean that we take the average of the two cost values above as shown in the following chart:

image

In this chart, option 1 corresponds to the left endpoint of the line while option 2 corresponds to the right endpoint of the line. So, although our vouchers are very successful, they don’t make sense here as they result in higher costs. Of course we could think of changing the value for the voucher but we would have to run another test to see how our customer base responds to the new voucher level (I’ll get back to this point in part 3 of this series). In most cases, the random sample method from above would have been done using different groups with different values for the vouchers. The effect on our success rate is usually not linear, so it’s worth finding the best value for a voucher. But we’ll keep things simple here and assume that the $10 voucher is already the best we could do.

 

Data Mining

Did I mention data mining in the title? So far we didn’t do any data mining with on our data. But maybe data mining gives us an option to optimize the costs in our scenario?

The main difference between our analysis from above and the data mining approach (and some other options that I will come to) is that now we are looking at the individual customer. In the first approach we just had an average customer drop-off rate of 50%. With data mining, we’re trying to predict the individual rate/score per customer. To do so, we have to set up a mining model that is trained with the knowledge we have about our customers (for example from the web registration form) and the individual buying behavior in the past. This model can then be used to predict returning customers. As this post is about cost optimization, we’re not actually building the model here but we just look at the output. So let’s assume that every customer now has a churn score associated with the customer record and that the model is validated and tested and performs well (see my other posts about back testing mining models).

We can now calculate the individual costs per customer for our two options from above:

Option

Costs

Option 1: Customers gets no voucher churnscore x [average profit]
Option 2: Customer gets voucher churnscore x (1- [prevention success rate]) x [average profit]
+ [cost of a single voucher]

For the first option, the calculation is simple. We only take the expectancy value of the profit loss here. If the individual churn score is for example 30% (0.3) then the costs in this calculation are 30% x $25 = $7.5. Please note that although we are looking at a data mining approach here, I’m still using the average profit of the customer. We will change this later to see the additional effect.

For the second option, we’re using our prevention success rate (70% from above). Even with the voucher, some customers will still not be returning. And as in the example from above, we have to add the costs of the voucher itself. If the individual churn score is 30% (0.3 as in the example above), then this case would result in the following costs: 30% x (100%-70%) x $25  + $10 = $12,25.

For my example I’m using randomly generated test data (with a modified random function). As long as our average rates from above match with the score we’re using here, it is obvious that both approaches give the same result for the two decisions from above: no customers gets a voucher and every customer gets a voucher. So how do we use our churn score now? The answer is simple: Instead of addressing for example 50% of the customers we could now address 50% of the customers with the highest churn scores as we assume that it is more likely for them to go away.

The chart from above may now look like this:

image

You can clearly see the effect of the optimization here (green line). As we’re talking about costs, the lower the better. The green line showing our data mining approach has a minimum at appox. 22%. The value at that point is $11,818. So, using the data mining approach we could save $682 based on our sample set of 1,000 customers, compared to the total costs of $12,500 for the trivial approach. This is an improvement of about 5.45%. Remember though, that we’re only looking at sample data here, so for your scenario the output may be totally different. Also, customers with a high churn rate might not respond to our vouchers in the same way as customers with a lower churn rate.

In the next part of this post we’ll focus more on the profit value of this scenario.

  • Share/Bookmark

BOL Community Article about Back-Testing Data Mining Results

Hilmar Buchta

SQL Server 2008 | SQL Server 2008 R2

In addition to my blog posts about back-testing Data Mining Results (see part 1, part 2 and part 3) I also wrote a Books Online Community Article that covers the topic at a more detailed level. The article can be downloaded here: http://msdn.microsoft.com/en-us/library/gg557481.aspx. The sample data and solution is available for download. The link is provided in the article.

You will find all the Community Articles for SQL Server 2008 here: http://msdn.microsoft.com/en-us/library/cc872864(SQL.100).aspx

  • Share/Bookmark

Cumulated Gains Chart and Lift Chart in SSRS

Hilmar Buchta

SQL Server 2008 | SQL Server 2008 R2

This post is about reproducing some of the mining charts that are built into Business Intelligence Development Studio using Reporting Services. A common use case might be that you periodically train your model and the results of the training should be published as html reports.

For my example I use the Adventure Works Targeted Mailing mining structure. We want to plot the results from the “TM Decision Tree Model”. First let’s start with the lift chart. In BIDS, the chart for “Bike Buyer=1” looks like this:

image

In order to reproduce the chart, we’re going to use the SystemGetLiftTable stored function in SSAS. Since we need to do some computations, I prefer to use SQL, so we have to define a linked server first:

EXEC sp_addlinkedserver
@server=’SSAS_AW’,
@srvproduct=”,
@provider=’MSOLAP’,
@datasrc=’Adventure Works DW 2008R2′

The query for loading the results from the lift chart looks like this:

SELECT percentile,
       VALUE,
       CASE
         WHEN percentile >= 100 * CONVERT(FLOAT, totalattributevaluecases) /
                            totalcases
       THEN 100
         ELSE percentile / (( CONVERT(FLOAT, totalattributevaluecases) /
                              totalcases ))
       END idealmodel
FROM   Openquery(ssas_aw,
‘CALL SystemGetLiftTable([TM Decision Tree], 2, ”Bike Buyer”, 1)’) AS
derivedtbl_1
 

For the chart, we use the percentile of the lift table as category group (x-axis) and choose these entries for the values (y-axis):

  • Sum(Value)
    This is the actual lift curve for our mining model
  • Sum(Percentile)
    As this is the same value as on the x-axis, this gives the random guess model line (linear)
  • Sum(IdealModel)
    The case-when-end statement above reflects the curve for the ideal model (linear function from 0 to 1 between 0 and totalattribute/totalcases)

This is how the chart looks like in the SSRS designer:

image

There are different names for this kind of a chart. For this post I’m using the naming conventions as shown here: http://www2.cs.uregina.ca/~hamilton/courses/831/notes/lift_chart/lift_chart.html, so the chart is labeled “Cumulated Gains Chart” instead of lift chart (BIDS).

So, here is the resulting Cumulated Gains Chart from my SSRS report:

image

As you can see, the chart looks pretty much the same as the chart that was displayed in BIDS.

Another useful chart is the actual lift factor of the model, often referred to as the ‘lift chart’. For this chart, we use the same SSAS function but we need to compute the lift in our query: 

SELECT percentile,
       VALUE / percentile AS lift,
       1                  AS randomguess
FROM   Openquery(ssas_aw,
‘CALL SystemGetLiftTable([TM Decision Tree], 2, ”Bike Buyer”, 1)’) AS
derivedtbl_1
 

This is how our chart looks like:

 image

Finally, we also want to reproduce the classification matrix (often referred to as confusion matrix). This is how the matrix looks like in BIDS:

image

To get these results, we use the function SystemGetClassificationMatrix as shown below:

SELECT predictedvalue,
       SUM(CASE actualvalue
             WHEN 0 THEN [COUNT]
             ELSE 0
           END) actual_0,
       SUM(CASE actualvalue
             WHEN 1 THEN [COUNT]
             ELSE 0
           END) actual_1
FROM   Openquery(ssas_aw,
‘CALL SystemGetClassificationMatrix( [TM Decision Tree], 2, ”Bike Buyer”)’
) AS derivedtbl_1
GROUP  BY predictedvalue
 

The second parameter of the function SystemGetClassificationMatrix means
1 – training data
2 – test data
3 – both (training and test data)

I used a simple matrix on my report to display the result from the query above and applied some coloring in order to distinguish error cases from correct cases:

image

To complete this post, I’d like to point out that you can also query your model cases directly, using a query like

SELECT *, IsTestCase() As TestCase FROM [TM Decision Tree].CASES

This query returns all the cases from the TM Decision Tree model with an additional column “TestCase” that contains true, of the case belongs to the test data set. In order to run this query you need to enable the drill-through option for the mining model. You can find out more about the SSAS mining functions in this post: http://www.bogdancrivat.net/dm/archives/14

  • Share/Bookmark

Do you trust your Data Mining results? – Part 3/3

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

This is the third and last post of my Data Mining back testing mini series. For my first posts I made the assumption that no action is taken based on the prediction. Although the mining process marked some of our contracts with a high probability for being cancelled, we did nothing to prevent our customers from doing so.

Of course, this is not the real life scenario. Usually our goal is churn prevention (or some other action based on the mining results). However, any action we take will (hopefully) change the behavior of our customers. Or to be more precise, we expect that less customers are really cancelling their contract. This is always a challenge for our mining process in total as the prevention may “dry out” our mining source data. After a while we might not have any non-influenced customers left with a high probability for cancelling the contract. A test group would help but in many scenarios this is not wanted (because the customers are just too important to risk loosing them just for the sake of some statistical validation). Another method would be to leave the influenced customers out of the training set. This doesn’t disturb our prediction but do we still know our prevention is successful enough?

For our back testing process, we can also add the expected prevention rate to our Monte Carlo procedure. In this case we’re not only trying to validate the model but also the prevention rate. On the other hand, if our test fails we’re not sure if it’s the model or our assumed prevention rate that is actually wrong.

For the following I assume that every customer with a churn probability greater then 35% gets prevention. From the past we know that 90% of our preventions are successful. Here are the results:

Old model (no prevention included):
image

New model (including prevention):
image

As expected the curves have shifted to the left. In the new model (including the prevention) the number of customers cancelling the contract is significantly lower than in the old model.

We can still use the same methods as shown in the first post to derive the threshold value for our test. This is the curve for the alpha and beta error (again: for beta for choose an alternative model that goes off by 3%).

image

Again, here are some values for the threshold T, alpha and beta:

T

Alpha

Beta

1580

88.7

0.0

1600

75.6

0.1

1620

56.7

0.4

1640

36.5

1.6

1660

19.5

5.1

1680

8.4

13.3

1700

3.0

27.0

1720

0.9

46.0

1740

0.2

65.7

For example with T=1680 we get alpha<10% and beta<14%. But as being said above, if our model fails the test we’re now not sure anymore if it is the model or the prevention rate. Again, the best way to avoid this uncertainty would be to use test groups. For example, we could exclude a random sample of 10% of all contracts with a churn score above 35% from prevention. Then we’re able to compare the behavior of the random sample (test group) with the customers that received prevention benefits. This test group enables us to measure the effectiveness of our prevention activity. And of course you could set up a statistical test in the same way we described here, to proof that the test group really supports a certain success rate.

To wrap things up, depending on the products, processes, the way new customers are acquired etc., the real-life data mining problems may be much more difficult than presented in this small blog series. Anyway, the methods described here, like the Monte Carlo algorithm or the statistical hypothesis test, are the main building blocks for achieving a reliable understanding how well the model performs. So, depending on the actual scenario, the above mentioned methods may be combined, adjusted and repeated to get the desired results.

This also ends this mini series about back testing. The basic methods are more or less ubiquitous and may also be used for several similar problems, for example to do the back testing for a financial risk model. By adjusting and combining the methods, a specific business requirement may be modeled into a reliable test. And as long as our model conforms to the test, we can have enough trust in our data mining models to use them as the basis for business decisions.

  • Share/Bookmark

Do you trust your Data Mining results? – Part 2/3

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

While the first part of this post was more about the idea and interpreting the results of the test, this part shows how to implement the Monte Carlo test.

First, we need a table with the predicted data mining probabilities. This is the output of the PredictProbability function from your mining result query. I’m using the same source data as in my previous post here. If you like you can easily create your own table and populate it with random probability values in order to test the code for the simulation below:

CREATE TABLE [dbo].[Mining_Result](
    [CaseKey] [int] NOT NULL,
    [PredictScore] [float] NULL
) ON [PRIMARY]

declare @i int=0

    while (@i<10000) begin

        insert into Mining_Result(CaseKey, PredictScore)
        values(@i, convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5)

        set @i=@i+1

end

Don’t be confused by the convert(…cast…cast newid()…) expression. This is just my approach to calculate a random number within an SQL select statement.

Next we need a table for storing our Mining results:

CREATE TABLE [dbo].[Mining_Histogram](
    [NumCases] [int] NOT NULL,
    [Count] [int] NULL,
    [Perc] [float] NULL,
    [RunningPerc] [float] NULL,
CONSTRAINT [PK_DM_Histogram] PRIMARY KEY CLUSTERED
(
    [NumCases] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

 

Then this is how we’re doing our Monte Carlo test:

truncate table Mining_Histogram;

declare @numtrials int = 10000;
declare @cnt int;
declare @lp int;

set @lp=0;

– perform a monte carlo test:

while (@lp<@numtrials) begin

    select @cnt=COUNT(*) from Mining_Result where PredictScore >
        convert(float,CAST(CAST(newid() AS binary(4)) AS int))/2147483648.0/2+.5        
    if exists(select NumCases from Mining_Histogram Where NumCases=@cnt)    
            update Mining_Histogram set [Count]=[Count]+1 Where NumCases=@cnt
        else
            insert into Mining_Histogram(NumCases,[Count]) values (@cnt, 1)
    set @lp=@lp+1;   

end

I’m using the same trick for the random numbers as shown above. In this example, we’re doing 10,000 iterations. For each iterations we compute the number of cases for which the Predicted Score is higher than a random number. For example, if for a certain case the predict score is 0.8 it is more likely that a random number between 0.0 and 1.0 is below the score than for a prediction score of 0.1.

Next, we’re filling the gaps in our histogram table with zeros to make the histogram look nicer:

declare @min int;
declare @max int;
select @min=MIN(NumCases), @max=MAX(NumCases) from Mining_Histogram

set @lp=@min;
while (@lp<@max) begin
    if not exists(select NumCases From Mining_Histogram Where NumCases=@lp)
        insert into Mining_Histogram(NumCases,[Count]) values (@lp, 0);
    set @lp=@lp+1
end

Finally we’re computing the row probability and the running total using this T-SQL:

declare @maxcount float;
select @maxcount=SUM([Count]) from Mining_Histogram;
update Mining_Histogram Set Perc=[Count]/@maxcount;

declare @CaseIdx int
declare @perc float
declare @RunningTotal float =0

DECLARE rt_cursor CURSOR FOR select NumCases, Perc From Mining_Histogram
OPEN rt_cursor

FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc

WHILE @@FETCH_STATUS = 0
BEGIN
  SET @RunningTotal = @RunningTotal + @perc
  update Mining_Histogram set RunningPerc=@RunningTotal Where NumCases=@CaseIdx
  FETCH NEXT FROM rt_cursor INTO @CaseIdx, @perc
END

CLOSE rt_cursor
DEALLOCATE rt_cursor

 

After running the simulation this is how the plots of the result look like (using my own values). The first plot shows the value of the field NumCases on the x-axis and the value of the field perc on the y-axis. The second plot has the same x-axis but shows the RunningPerc field on the y-axis:

image

 image

These two plots look very much the same as the plots from my last post (although I used C# code there to generate the histogram data).

If you used the randomly generated scores from above for testing, you will notice the peak being around 5000 cases (instead of 2800 cases in my example).

And if you like a smoother version of the density function  (as all the teeth and bumps mainly result from Monte Carlo approach), you could use this SQL query to compute a moving average:

declare @minrange int=0
declare @windowsize int = 50

select @minrange=Min(NumCases) from Mining_Histogram

SELECT     H.NumCases, AVG(H1.[Count]) [Count], AVG(H1.Perc) Perc
FROM         Mining_Histogram H
left join Mining_Histogram H1 on H1.NumCases between H.NumCases-@windowsize and H.NumCases

where H.NumCases>@minrange+@windowsize

group by H.NumCases

image

In order to do the histogram computation automatically with prediction query I recommend putting the code in an SSIS script component. I would also use another type of random number generator. This also allows you to set the seed for the random number generator. For my implementation I used an asynchronous script component that first loads all cases into memory (ArrayList collection), then performs the Monte Carlo test on the in-memory data and then writes the results back to the output buffer. This allows you do work with more scenarios and to log the progress during the loading and testing phase of the component.

I’m planning to write a Books Online Community Technical article on this topic. This article will be more detailed regarding the implementation. I will post a link to this article in my blog then.

  • Share/Bookmark

Do you trust your Data Mining results? – Part 1/3

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

Data Mining has been built into SQL Server since version 2005 and it’s quite comfortable and wizard-driven to design your mining models. However, Data Mining is not much about the toolkit but more about data preparation and interpreting the results. Without a proper data preparation, the algorithms will fail in really predicting or clustering the data. And the same is true for the interpretation of the results. But before we can start interpreting the data, we have to trust the results. At the design time of each mining model we can use test case holdouts, lift charts and cross validation to see if the model is robust and meaningful. But most of our prediction models try to predict a future behavior based on the knowledge of today and the past. What if there are significant changes in the market that are not already trained into our models? Is our model still correct or are we missing an important variable?

Today’s post is about implementing a back testing process to validate the mining results. For our example, we use a churn score prediction model. Think of a telecommunication company: Each customer has a 12 months contract that can be cancelled by the customer at the end of the period. If the contract is not cancelled, it’s continued for another 12 months. We want to predict how many customers are going to cancel their contracts during the next 3 months (the 3 months latency has to be build into our training set, but this is a different topic). To make things more simple in the first step let’s assume the company is not going to use the mining results (no churn prevention) but just waits 3 months to compare the reality with the prediction. That’s what we’re doing during a back testing.

So let’s assume that we did a churn prediction 3 months ago. Our results are returned from our data mining model as a table like this:

CaseKey

Churn
(
Churn Prediction)

ChurnProbability
(Churn Probability %)

1

true

87.4

2

false

7.1

3

false

1.7

4

true

50.2

5

false

11.3

6

false

16.0

7

false

6.9

8

false

1.8

9

false

2.6

10

false

18.7

For my example I have 30,000 contracts (cases). Churn prediciton = true means that this contract (case) is likely to be cancelled by our customer (predicted by the mining model). In my dataset this is true for 2010 cases. I left some more columns out here, but usually you are also looking at the support and other measures.

Now, 3 months have passed and we want to check how good our initial data mining model was. As said below we didn’t do anything to prevent our 2010 cases from above to cancel their contract. Now, looking at our CRM system reveals that actually 2730 customers of those 30,000 cancelled their contract. What does this mean? We expected 2010 to cancel but in reality it was 2730. Does this mean our model is wrong? Or can we still rely on the model?

The clue to answering this question is to compute how likely it is, that

a) Our model is correct and
b) We see 2730 customers to cancel their contract

Just to avoid confusion, we’re not looking at the error cases within our prediction (as we would do with a Receiver Operating Characteristics analysis) but we try to validate the model itself.

If the Churn probability is the same for each case we could use a binomial test validate the model(see one of my very first blog posts about this topic). Another way to do this computation is to run a Monte Carlo scenario generator against our data from above. Basically, the test works like this

  • Do N loops (scenarios)
    • Set the number of cancelled contracts x for this scenario to zero initially
    • Look at each case
      • Compute a random number r and compare this number with the ChurnProbability p
      • If the r<p count this case as cancelled (increment x)
    • increment the number of occurrences of x cancellations

I’m showing an implementation of this approach in my next post but for today let’s just concentrate on interpreting the results. For my example I used a SSIS script component to actually perform the Monte Carlo test. I used 30,000 scenarios and ended up with the following result:

image

As you can see, most of my scenarios ended with approx. 2800 cancellations (peak in the chart). This might be the first surprise. Assuming the mining algorithm was right, there are still much more cancellations happening than being predicted in the predicted churn column. How can this be? Well, actually the predicted value follows a very simple rule: It switches at 50%. This is a strong simplification of the true distribution. So instead of looking at the predicted values you should better look at the expectancy value:

Predicted value Expectancy value
select count(*) from Mining_Result where churn=1 select SUM(ChurnProbability) from Mining_Result
Result: 2010 Result: 2784

As you can see, the expectancy value matches our distribution histogram from above. In most situations, the expectancy value differs from the value count. This is highly dependent on the distribution of the probability values. For example, doing the same test with the bike buyer decision tree model in Adventure Works I got 9939 cases with a predicted value of 1 for the BikeBuyer variable. Here the expectancy value is about 9135, so in this case it is lower than the number of predicted cases.

Back to our histogram from above. We can easily replace the number of cases by the percentage value of the total cases. This results in the probability density function. In order to proceed we have to use the aggregated density function. For our example, this function looks like this

image

This function tells us the probability for seeing less than a certain number of cancellation. As expected, the probability to see less then 30,000 cancellations is 100% (as we only have 30,000 customer who could cancel). On the other hand, the probability to see less than 0 cancellations is 0%. Again it may be a surprise to see that actually the probability for seeing less than 2600 cancellations is close to 0 (from the graph above). How does this look around our real number of 2730 cancelled contracts? Here is the extract from the table:

NumCases

TotalProbability %

1-TotalProbability %

2726

7.9

92.1

2727

8.3

91.7

2728

8.7

91.3

2729

9.1

90.9

2730

9.5

90.5

2731

9.9

90.1

2732

10.3

89.7

2733

10.8

89.2

2734

11.2

88.8

In this table, the total probability is the aggregated probability from our chart above and means the probability for seeing less than NumCases cancellations while 1 minus total probability means the probability to see more than NumCases cancellations.

From this table you can see that the probability for seeing more than 2730 cancellations is still about 90.5%. Now let’s look at the area between 2740 and 2920 cancellations (to reduce the number of lines I’m only showing every 20th row):

NumCases

TotalProbability %

1-TotalProbability %

2740

14.4

85.6

2760

28.4

71.6

2780

46.5

53.5

2800

65.1

34.9

2820

80.7

19.3

2840

91.0

9.0

2860

96.6

3.4

2880

98.9

1.1

2900

99.7

0.3

2920

99.9

0.1

While it is still likely (86%) to see more than 2740 cancellations, it becomes more and more unlikely with higher the value gets for the cancellation. And seeing more than 2900 cancellations is very unlikely (less than 1%). Of course, this only refers to the case, that the model is operating correctly.

In order to make our back testing an easy procedure we want to define a simple threshold T. Our model passes the test as long as there are no more cancellations than T. If the model does not pass the test, we have to re-validate the variables and check the overall state of the model. We do not want to do this too often. Therefore, the probability for our model being correct and still not passing the test should be less than 10% (remember that we will run the mining prediction over and over again). Now we need to find a proper value for T.

From our table above we can see that T is close to 2840 cancellations. We can query the correct value from our histogram table:

select MIN(numcases) from mining_histogram  where 1-TotalProbability<0.1

The result is T=2838 for my example. So our 2730 cancellations from above are definitely below our test threshold T and therefore our model clearly passes the test.

Now that we’ve set a threshold to the condition “model correct and cancellations>2838”, what about the situation in which our model is incorrect. In this case we would assume that the real probability for a customer cancelling the contract is higher than predicted by our model. This is of course only one assumption we could make. Depending on the conditions and the environment, the definition of the “wrong model” can be different. In any case we have to define a “wrong” or alternative model.

For our example, this is how our model looks like with a 3% higher probability for cancellation (blue line).

image

For this chart, we have to create a separate histogram table for our alternative (wrong) model and also calculate the results in our Monte Carlo process. Again, we can read the probability for the condition “model is wrong but passes the test” from our histogram table. In our case it’s about 24.4%.

If you’re not interested in a more statistical view of our test you can now skip to the conclusion below.

Otherwise you probably already know that there are two possible mistakes we could make here:

  • model is correct but fails the test, usually referred to as type 1 or alpha error (false positive)
  • model is incorrect but passes the test, usually referred to as type 2 or beta error (false negative)
 

Model is correct

Model is incorrect
(alternative model is correct)

Test is negative, meaning the model passes the test

correct result(probability=1-alpha, so called specificity of the test) type 2 error / beta error

Test is positive, meaning the model does not pass the test

type 1 error / alpha error correct result (probability=1-beta, so called power or sensitivity of the test)

In order to get a better understanding of our test situation we can plot alpha and beta together into one chart:

image

The green line shows the probability for more than N cancellations in our correct model. The red line shows the probability for less than N cancellations in our wrong model. The bigger our threshold gets (the more we get to the right side in the diagram) the

  • lower our alpha gets (lower risk for the type 1 error “model is correct but does not pass the test”)
  • higher our beta gets (higher risk for type 2 error “model is incorrect but does pass the test”)

Here are some sample values for different values of T

T

Alpha

Beta

2780

53.5

1.8

2790

44.3

3.1

2800

34.9

5.2

2810

26.5

8.4

2820

19.3

12.7

2830

13.5

18.6

2840

9.0

25.9

2850

5.7

34.1

2860

3.4

43.5

2870

1.9

53.0

2880

1.1

62.2

If we want to keep alpha below 5% you can see that beta will over 35%. On the other hand, if we try to keep beta below 5% alpha we will over 35%. The “best” values for alpha/beta or not simply the value for T where alpha is almost equal to beta (here T=2826, alpha and beta approx. 16%) but depends on the error that you are wanting to minimize. In our example above we demanded alpha<10% which resulted in T=2830 and beta being about 25%.

In our case the reason for keeping alpha low would be to avoid readjusting the model too often (causing costs). On the other hand, keeping beta low reduces the risk of working with a wrong model and potentially loosing more customers. Basically this decision has to be made before actually defining the threshold value T.

 

Conclusion

For this example we defined a very simple test (number cancellations < 2838) which satisfies these two criteria:

  1. It is unlikely (<10%) that our model does not pass the test although it is correct
  2. It is unlikely (<25%) that our model passes the test although it is wrong (goes off by 3%)

While the 1st criteria means we’re not loosing more customers as expected, the 2nd criteria means we’re not spending too much budget on fine tuning the model.

It should be stated that the above calculations are done on modified (randomized) data. In practical life it can be more difficult to find a proper test for the model and also the tradeoff between error 1 and 2 can be much higher.

  • Share/Bookmark

Data Mining on a small amount of data

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

Data Mining is usually associated with finding previously unknown patterns in a large amount of data. But also a small amount of data may contain patterns, that are difficult to spot. In order to illustrate this, let’s look at the following situation. A customer with 80 stores wants to understand why some stores perform better than others. This is especially important before setting up new stores. It would be great to estimate the performance of the new store before building it up. This would make it much easier to decide on future store locations. Also, it might be useful for optimizing the performance of the existing stores. In order to find out about this, the customer collects some data per store as shown in the following table:

Criteria Description
Sales area Area for sales (in square meters)
Total working hours per week Total working hour for all staff members per week (avg over the last 3 months)
Total opening hours per week Total hours that the store is open per week
Location E.g. City center, City or Outskirts
Location type E.g. Mall, Plaza or separated
Store interior status Condition of the store, e.g. modern, average or old
Store age in months How old is this store
Parking facilities Is it easy to park near the store? E.g. values include good, average, bad
Average parking costs per hour What are the average parking costs per hour? Zero means free parking
Number of competitors within 10 minutes walk distance Number of competitors within 10 minutes walk distance
Number of competitors within 15 minutes driving distance Number of competitors within 15 minutes driving distance
Buying Power Buying power of the people how live near the store ranging from very low to very high.
Sales amount per week Average over the last three months

The data can be retrieved by querying the IT systems (for example HR and ERP), by using a survey (usually the staff in the store knows about the competitors and parking facilities around) or by using external market research data (for the Buying power). In my case, the data is just generated sample data:

image

After gathering all the data, things got more complicated as expected. How do you “score” each store? Which of the parameters are most relevant? Even if we’re only having 80 rows of data, it is not at all easy to see the dependencies.

For this example I’m using the Microsoft Office Data Mining Add-In so we can do all the data analysis using Excel. Since it’s Excel everything should be very easy. We want to use the Microsoft Decision Tree algorithm (Icon “Classify” from the ribbon bar):

image

The process is pretty easy. We have to decide which attribute we want to predict (Sales amount per week) and the wizard does all the rest. Now, here is the complete resulting decision tree:

image

No branches? What did we do wrong? Well, here are a few steps we should have taken, before simply invoking the decision tree. The most important thing is the proper preparation of the data:

 

1. Create relative measures, not absolute ones

The decision tree is capable of detecting rules like “if A then B” or “if A then not B” or even complicated combinations. However, dealing with continuous values is more difficult. The decision tree does not work quite well with rules like “if A is multiplied by 2, B is multiplied by 1.5” but tries to branch this as “if A is > 20 then B > 15”, “if A is >10, then B >7.5”. This might be especially true with input variables like our sales area or opening hours.

To quickly analyze the relationship we can use Excel’s Scatter chart type. Let’s start with the store size:

image

From this chart it seems reasonable to calculate sales by square meter instead of taking the absolute sales amount as there seems to be a more or less linear relationship between the sales and the store size.

Now let’s have a look at the influence of the opening hours. Again we’re using Excel’s Scatter chart:

image

As the trend line shows, the relationship seems to be a little bit logarithmic. However, let’s assume it’s also linear. Therefore we’re going to create an additional column in our spreadsheet computing sales by square meter and opening hours. This is the formula for our new column “Relative Sales”:

=[@[Sales Amount per week]]/[@[Sales Area]]/[@[Total opening hours per week]]

Of course, you would like to also check the influence of other variables, for example the age of the store:

image

This one looks pretty scattered, so we’re just taking the age of the store as an input variable.

But there is another relative measure we should create: The average number of sales persons in the store. We’re simply using this formula:

=[@[Total working hours per week]]/[@[Total opening hours per week]]

 

2. Make the input parameters discrete if possible

While it’s always a good idea to use discrete values for our input parameters, it almost becomes a must if you’re not having many rows of data. An example for a good discrete value is our location as it can only take these values: City center, City or Outskirts. The fewer the number of buckets, the better is it. If you’re not getting any results, try making the data more simple by choosing less buckets for your discrete values.

But look at our newly created columns for the relative sales or the average staff:

image

For our Mining purpose these are too many distinct values and although our decision tree will try to cluster them, we should do this in advance. Therefore we can either use Excel formulas or we could use the functionality of the data mining add-in: The “Explore Data” wizard:

image

After selecting the table and the column, the wizard analyzes the data and proposes some buckets as shown below:

image

For my purpose, I reduce the number of buckets to three (be brave!). By clicking the “Add new Column” button the resulting values are added as an additional column to our table:

image

I’m doing the same for the average staff members (4 buckets), the total opening hours (4 buckets) the store age in months (5 buckets) and the parking costs (4 buckets):

image

Our number of competitors is here 0-3 and 0-5, so we leave this data unchanged (not too many buckets).

So let’s try again with our prepared data set. Now, our decision tree looks like this:

image

As you can see there are only three influences identified by the mining algorithm here:

image

So, if parking is free, then the parking facilities are important while if parking is not free, the opening hours are important. This is a good start for looking for new store locations.

Another nice tool from Excel’s data mining add-in is the prediction calculator which can be found on the “Analyze” ribbon:

image

This one creates a ready to use input sheet in which you can enter the values of a potential new store and Excel immediately computes the likeliness for high sales:

image

This calculator is based on a different mining model (Logistic Regression). You can also see the impact of the input values on your sales:

image

In our case, smaller stores seem to perform better. The high value at the opening time below 41.5 hours may indicate that our computed column from above might not be well designed. And while there are quite a lot of stores with a sales area of less than 57 square meters there is only one single store which is already opened for 52 months: Our first store, which is definitely special (always equipped in the most modern style and only selected staff members are chosen to work there). Maybe you want to take this out of the data before doing the analysis.

So, after the mining you have to review your results properly. And of course you should also verify the other methods of making sure, your model is working fine (lift chart, case support as from my last post etc.).

  • Share/Bookmark

How much support do you need for your Data Mining results?

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

When querying your SSAS mining model you may have noted the prediction function predictsupport which is supplied by most of the mining algorithms, for example for the decision tree. The mining algorithm does not only predict a certain variable and gives the prediction score but also tells us how many cases were used to base this decision on (case support). You might have the feeling that cases with a lower support are not that reliable compared to cases with a higher support. This post is about how to determine the needed support for a given model.

To make this theoretical topic a little bit more practical, let’s look at actual data. For this post I’m using the targeted mailing decision tree model (TM Decision Tree) of the Adventure Works SSAS sample database. While you can do the same process described here with a data mining query result, I’m looking at the model itself instead. In order to do so I run the following query on my SSAS database:

SELECT FLATTENED [NODE_CAPTION]
        ,[NODE_DISTRIBUTION]
        ,[NODE_SUPPORT]
        ,[NODE_DESCRIPTION]

FROM [TM Decision Tree].content

WHERE IsDescendant(”, ’000000001′) AND [CHILDREN_CARDINALITY]=0

I left a lot of interesting columns out here to save the space for the query result. There are many more fields you can query (some depending on the mining algorithm).

By selecting “children cardinality equals zero”, we only get the child elements from our decision tree. And since we used the flattened keyword our NODE_SUPPORT element is returned as multiple lines here (one line per possible value, try removing FLATTENED from the query above to see the difference). So this is how the result looks like:

image

Since our Bike Buyer attribute can have three states (Missing, 0=no bike buyer, 1=bike buyer) we get three lines per node in our flattened decision tree model query. For example the first three lines belong to the node labeled “Year Income < 58000” which can be found at the top of the lower half in the decision tree model viewer:

image

If you look at the details of this node in the mining model viewer, you will see the following values:

image

This is almost exactly what we have in our table above. For this node that Mining algorithm found 1362+465=1827 cases that matched the conditions of the node. You can see the full list of conditions in the NODE_DESCRIPTION column (which I left out here). In my case this is

Number Cars Owned = 2 and Region not = ‘Pacific’ and Yearly Income < 58000

So from all the rows that matched this criteria 1362 had a Bike Buyer variable value of 0 and 465 had a value of 1 during the training of the model. The probability is then computed by the number of the cases, for example 100*465/1827 gives approx. 25.46%. Actually, the decision tree calculates the probability in a different way. For many cases this is very close to the quotient above, but there are differences. If you want to be exactly sure to control the method of calculation you may want to calculate the probability score on your own (for the example above this means to calculate 100*465/1827 instead of taking the probability provided by the mining algorithm).

If we use this model for prediction, probabilities above 50% will be mapped to the positive result. For our example, we want to predict the “Bike Buyer” variable. For cases that fall into the node displayed above, the result will be 0 (as the probability for zero is greater than 50%) meaning that these contacts are unlikely to buy a bike.

In order to proceed, I focus on the prediction value of Bike Buyer = 1. I copy the results from our query above to Excel and removed all lines for result 0 or missing. I also did some formatting of the table. These are the first rows of the resulting table.

image

In order to decide whether this support is good or bad, we have to decide on which basis we want to do the decision:

  1. Decision based on the predicted value
  2. Decision based on the predicted values probability

In the first case, we’re only using the predicted value from our mining. In our example it would be a prediction of the Bike Buyer variable as a value of true or false. Our concern would be that our support does favor the alternative decision with a high probability.

In the second case, we’re taking the real prediction probability for some further calculation, for example for the back testing (I will discuss this during my next posts) or for calculation expectancy values of our prediction (for example expected costs).

Let’s start with the first case here.

 

1. Decision based on the predicted value

This case is based on some rule to derive the prediction variable from the prediction score and optionally further attributes. SSAS does this on a very simple basis by changing the decision at 50%. But the rule can be very different. You may consider all rows with probability >30% as potential bike buyers or you may include your estimated revenue. You could also use the profit analysis that is built into SQL Server Data Mining to decide on the rule. But in any case, for your mining results you are only interested on the output of the rule (not on the “real” probability), in our case “true” or “false”.

For our example, I’m using the “default” rule of 50%. In order to understand the effect of the support, let’s look at our concerns. First we start with a case having a probability higher than 50% (or whatever our threshold is), for example take a look at line 3 of the Excel table from above. This case has a probability of 61.28% meaning that 65 our of 106 rows had the Bike Buyer variable set to 1.In this case, our concern is:

The real probability for “Bike Buyer =1”  is 50% or lower but still the random sample of data we used during training resulted in 65 positive rows.

If this happened, our model would be choosing the wrong decision for all rows that are matching the node’s criteria.

In order to deal with this concern statistically, I’m doing a simple and common trick here (maybe you should allow yourself some minutes to think about that). Let’s change the concern to

The real probability for “Bike Buyer =1”  is exactly 50% but still the random sample of data we used during training resulted in 65 or more positive rows.

For this concern we can simple use Excel’s binomial function to compute the probability:

1-BINOM.DIST(65,106,0.5,TRUE)

I used the Excel 2010 function here. If you’re using Excel 2007 or before, the function name is BINOMDIST. The above function returns a value of 0.7%. So it is very unlikely that our model training was based on a wrong decision.

For the lines with a probability of less than 50% our concerns are just the other way round (of course you could also take the same rules as above and look for “Bike Buyer=0”). For our example I reference the first line in the table above:

The real probability for “Bike Buyer = 1” in this specific node is 50% or higher but we still see 465 positive rows in our 1827 cases.

Again, I’m transforming this to:

The real probability for “Bike Buyer = 1” in this specific node is exactly 50% but we still see 465 or less positive rows in our 1827 cases.

In this case, we have to use this function to compute the probability:

BINOM.DIST(465,1827,0.5,TRUE)

Again the probability is very low (almost exactly 0%).

Let’s add this calculation to the Excel-table from above:

image

The formula for E7 is

=IF(C7<0.5,BINOM.DIST(B7,D7,0.5,TRUE),1-BINOM.DIST(B7,D7,0.5,TRUE))

This formula is then copied for all lines below.

How do we read this table? To keep things simple, low values in the last column are good (as in our two example values from above). The closer the predicted probability gets to 50% (or in other words, the weaker the prediction gets) the more support is needed to make it relevant.

How do you work with this table? Let’s say you want to be 95% sure that the support is strong enough to prevent our mining system from being trained for the wrong response. In this case you would filter out nodes from the table above that have a value of 5% or above in the last column. This is the result

image

In this case we would have filtered out 608 from 12939 cases used for the support (approx. 4.7%). What are you going to do with these cases? Well, this depends on your mining question. Typically you would consider these nodes to be bike buyers if you want to minimize the risk of loosing potential customers or you would consider them as not being bike buyers if your goal is to reduce costs (by excluding these cases from the mailing). In any case, you will have to ignore the response from the data mining process if you do not want to risk being misled.

In our case it is ok to drop 608 cases, but what do you do if all cases have a very poor support? In this case, you should review your model carefully. Seeing a lot of cases with a low support usually means that your model is “over trained”. For example, your decision tree model has too many nodes. Mining works best when the real world can be simplified. For the decision tree you might try to increase the required support or set a higher value for the complexity penalty parameter.

The more you want to be sure that your case support is good enough the more cases you will have to drop. For our example here are a few values:

image 

 

 

2. Decision based on the predicted values probability

In this case we’re not so much interested in a simple decision (bike buyer true or false) but more about the probability for each case. This also means we’re having to consider different concerns (although similar compared to the ones above). The whole process is very similar to the one above and if you are not too much interested in the outcomes you can just skip to the conclusion.

Basically, we only need to replace the 50% in the example above with an alternative model’s probability derived from the predicted probability. However, as described above, our concerns differ in the cases the favor a positive or negative result. So for our example, we’re considering a cut-off at 50% again.

Let’s start with the lines having a probability higher than 50%. For example, let’s look at the 3rd line from the table above. The probability found during training was 61.28% that is, 65 of the 106 rows hat the Bike Buyer variable set to 1.

In this situation our concern is that the real probability is less than 61.28% but still the random sample we used during training resulted in 65 positive rows. Ok, if the probability is still 61.279% this wouldn’t make much difference. Let’s try with a difference of 3%, so our concern (already transformed as described above) is:

The real probability for “Bike Buyer = 1” in this specific node is exactly 58.28% but we still see 65 or more positive rows in our 106 cases.

Again we’re using Excel’s binomial function to compute the probability:

1-BINOM.DIST(65,106,0.5828,TRUE)

The above functions returns a value of 23.26%. Although it is below 50% it is still significant (compare this to the 0.7% from above).

For the lines with a probability of less than 50% our concerns are just the other way round. For our example I use the first line from the table above as a reference again:

The real probability for “Bike Buyer = 1” in this specific node is exactly 28.46% but we still see 465 or less positive rows in our 1827 cases.

To compute this in Excel you have to take the following formula:

BINOM.DIST(465,1827,0.2846,TRUE)

In the case we get a result of 0.22% meaning it is extremely unlikely.

I added this calculation as an additional column to my Excel table from above. This is the result:

image

The formula for E7 is

=IF(C7<0.5,BINOM.DIST(B7,D7,C7+$B$4,TRUE),1-BINOM.DIST(B7,D7,C7-$B$4,TRUE))

This formula is then copied for all lines below.

Again we can now filter the table for whatever sureness we would like to have. If we want to be 80% sure that your case support does not favor a model that is 3% off, we would filter all lines from the table above that are having a value higher than 20% in the last column.

image

A lot or nodes do remain in this case (a lot more than in our discrete example from above): In this case we would have to ignore about 19.4% of our predictions.

Here are some sample results for the cases that need to be left out:

image

As expected, the more you want to be sure the more cases you have to drop. Also the less tolerance you allow for your model, the more cases you have to drop. So it’s up to you to find the best mix of security and usability of the model.

  • Share/Bookmark

  • Kategorien

  • Copyright © ORAYLIS Blog. All rights reserved.