Archive for August, 2010

Data Mining on a small amount of data

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

Data Mining is usually associated with finding previously unknown patterns in a large amount of data. But also a small amount of data may contain patterns, that are difficult to spot. In order to illustrate this, let’s look at the following situation. A customer with 80 stores wants to understand why some stores perform better than others. This is especially important before setting up new stores. It would be great to estimate the performance of the new store before building it up. This would make it much easier to decide on future store locations. Also, it might be useful for optimizing the performance of the existing stores. In order to find out about this, the customer collects some data per store as shown in the following table:

Criteria Description
Sales area Area for sales (in square meters)
Total working hours per week Total working hour for all staff members per week (avg over the last 3 months)
Total opening hours per week Total hours that the store is open per week
Location E.g. City center, City or Outskirts
Location type E.g. Mall, Plaza or separated
Store interior status Condition of the store, e.g. modern, average or old
Store age in months How old is this store
Parking facilities Is it easy to park near the store? E.g. values include good, average, bad
Average parking costs per hour What are the average parking costs per hour? Zero means free parking
Number of competitors within 10 minutes walk distance Number of competitors within 10 minutes walk distance
Number of competitors within 15 minutes driving distance Number of competitors within 15 minutes driving distance
Buying Power Buying power of the people how live near the store ranging from very low to very high.
Sales amount per week Average over the last three months

The data can be retrieved by querying the IT systems (for example HR and ERP), by using a survey (usually the staff in the store knows about the competitors and parking facilities around) or by using external market research data (for the Buying power). In my case, the data is just generated sample data:

image

After gathering all the data, things got more complicated as expected. How do you “score” each store? Which of the parameters are most relevant? Even if we’re only having 80 rows of data, it is not at all easy to see the dependencies.

For this example I’m using the Microsoft Office Data Mining Add-In so we can do all the data analysis using Excel. Since it’s Excel everything should be very easy. We want to use the Microsoft Decision Tree algorithm (Icon “Classify” from the ribbon bar):

image

The process is pretty easy. We have to decide which attribute we want to predict (Sales amount per week) and the wizard does all the rest. Now, here is the complete resulting decision tree:

image

No branches? What did we do wrong? Well, here are a few steps we should have taken, before simply invoking the decision tree. The most important thing is the proper preparation of the data:

 

1. Create relative measures, not absolute ones

The decision tree is capable of detecting rules like “if A then B” or “if A then not B” or even complicated combinations. However, dealing with continuous values is more difficult. The decision tree does not work quite well with rules like “if A is multiplied by 2, B is multiplied by 1.5” but tries to branch this as “if A is > 20 then B > 15”, “if A is >10, then B >7.5”. This might be especially true with input variables like our sales area or opening hours.

To quickly analyze the relationship we can use Excel’s Scatter chart type. Let’s start with the store size:

image

From this chart it seems reasonable to calculate sales by square meter instead of taking the absolute sales amount as there seems to be a more or less linear relationship between the sales and the store size.

Now let’s have a look at the influence of the opening hours. Again we’re using Excel’s Scatter chart:

image

As the trend line shows, the relationship seems to be a little bit logarithmic. However, let’s assume it’s also linear. Therefore we’re going to create an additional column in our spreadsheet computing sales by square meter and opening hours. This is the formula for our new column “Relative Sales”:

=[@[Sales Amount per week]]/[@[Sales Area]]/[@[Total opening hours per week]]

Of course, you would like to also check the influence of other variables, for example the age of the store:

image

This one looks pretty scattered, so we’re just taking the age of the store as an input variable.

But there is another relative measure we should create: The average number of sales persons in the store. We’re simply using this formula:

=[@[Total working hours per week]]/[@[Total opening hours per week]]

 

2. Make the input parameters discrete if possible

While it’s always a good idea to use discrete values for our input parameters, it almost becomes a must if you’re not having many rows of data. An example for a good discrete value is our location as it can only take these values: City center, City or Outskirts. The fewer the number of buckets, the better is it. If you’re not getting any results, try making the data more simple by choosing less buckets for your discrete values.

But look at our newly created columns for the relative sales or the average staff:

image

For our Mining purpose these are too many distinct values and although our decision tree will try to cluster them, we should do this in advance. Therefore we can either use Excel formulas or we could use the functionality of the data mining add-in: The “Explore Data” wizard:

image

After selecting the table and the column, the wizard analyzes the data and proposes some buckets as shown below:

image

For my purpose, I reduce the number of buckets to three (be brave!). By clicking the “Add new Column” button the resulting values are added as an additional column to our table:

image

I’m doing the same for the average staff members (4 buckets), the total opening hours (4 buckets) the store age in months (5 buckets) and the parking costs (4 buckets):

image

Our number of competitors is here 0-3 and 0-5, so we leave this data unchanged (not too many buckets).

So let’s try again with our prepared data set. Now, our decision tree looks like this:

image

As you can see there are only three influences identified by the mining algorithm here:

image

So, if parking is free, then the parking facilities are important while if parking is not free, the opening hours are important. This is a good start for looking for new store locations.

Another nice tool from Excel’s data mining add-in is the prediction calculator which can be found on the “Analyze” ribbon:

image

This one creates a ready to use input sheet in which you can enter the values of a potential new store and Excel immediately computes the likeliness for high sales:

image

This calculator is based on a different mining model (Logistic Regression). You can also see the impact of the input values on your sales:

image

In our case, smaller stores seem to perform better. The high value at the opening time below 41.5 hours may indicate that our computed column from above might not be well designed. And while there are quite a lot of stores with a sales area of less than 57 square meters there is only one single store which is already opened for 52 months: Our first store, which is definitely special (always equipped in the most modern style and only selected staff members are chosen to work there). Maybe you want to take this out of the data before doing the analysis.

So, after the mining you have to review your results properly. And of course you should also verify the other methods of making sure, your model is working fine (lift chart, case support as from my last post etc.).

  • Share/Bookmark

Quick Launch Navigation auf Webpart Pages anzeigen

Sandra Erb

 

Wird in SharePoint eine neue Page erstellt (sei es eine Webpart-, Publishing- oder Blank Page) wird in dieser die Quick Launch Navigation (zu deutsch Schnellstartleiste) standardmäßig nicht angezeigt. Dies gilt auch für Seiten vom Typ “Meeting Space”.

An dieser Stelle geht leider das einheitliche “Look and Feel” des SharePoint  verloren und viele möchten die Navigation wieder anzeigen. SharePoint selbst bietet hierfür auch in der 2010er Version noch keinen Schalter in der Oberfläche. Doch mit Hilfe des SharePoint Designers kann dieses Verhalten schnell geändert werden.

  1. Die gewünschte Page im SharePoint Designer öffnen (am einfachsten über das Kontextmenü der Page, direkt aus SharePoint heraus)
  2. Wer bereits mit dem SharePoint Designer 2010 arbeitet muss noch den “Erweiterten Modus” aktivieren

    image

  3. In der Codeansicht nach einem “SharePoint:UIVersionedContent”-Element mit der ID “WebPartPageHideQLStyles” suchen. Dieses muss komplett gelöscht werden

     image

  4. Ebenso die folgenden Placeholder Elemente löschen: 
    1. PlaceHolderPageImage (optional)
    2. PlaceHolderNavSpacer (optional)
    3. PlaceHolderLeftNavBar (erforderlich)

    image

  5. Speichern und die Seite im Browser neu aufrufen 

Tritt diese Anforderung häufiger auf, so sollte ein eigenes Template erstellt werden, welches zukünftig bei der Erstellung einer neuen Page verwendet wird.

  • Share/Bookmark

How much support do you need for your Data Mining results?

Hilmar Buchta

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

When querying your SSAS mining model you may have noted the prediction function predictsupport which is supplied by most of the mining algorithms, for example for the decision tree. The mining algorithm does not only predict a certain variable and gives the prediction score but also tells us how many cases were used to base this decision on (case support). You might have the feeling that cases with a lower support are not that reliable compared to cases with a higher support. This post is about how to determine the needed support for a given model.

To make this theoretical topic a little bit more practical, let’s look at actual data. For this post I’m using the targeted mailing decision tree model (TM Decision Tree) of the Adventure Works SSAS sample database. While you can do the same process described here with a data mining query result, I’m looking at the model itself instead. In order to do so I run the following query on my SSAS database:

SELECT FLATTENED [NODE_CAPTION]
        ,[NODE_DISTRIBUTION]
        ,[NODE_SUPPORT]
        ,[NODE_DESCRIPTION]

FROM [TM Decision Tree].content

WHERE IsDescendant(”, ’000000001′) AND [CHILDREN_CARDINALITY]=0

I left a lot of interesting columns out here to save the space for the query result. There are many more fields you can query (some depending on the mining algorithm).

By selecting “children cardinality equals zero”, we only get the child elements from our decision tree. And since we used the flattened keyword our NODE_SUPPORT element is returned as multiple lines here (one line per possible value, try removing FLATTENED from the query above to see the difference). So this is how the result looks like:

image

Since our Bike Buyer attribute can have three states (Missing, 0=no bike buyer, 1=bike buyer) we get three lines per node in our flattened decision tree model query. For example the first three lines belong to the node labeled “Year Income < 58000” which can be found at the top of the lower half in the decision tree model viewer:

image

If you look at the details of this node in the mining model viewer, you will see the following values:

image

This is almost exactly what we have in our table above. For this node that Mining algorithm found 1362+465=1827 cases that matched the conditions of the node. You can see the full list of conditions in the NODE_DESCRIPTION column (which I left out here). In my case this is

Number Cars Owned = 2 and Region not = ‘Pacific’ and Yearly Income < 58000

So from all the rows that matched this criteria 1362 had a Bike Buyer variable value of 0 and 465 had a value of 1 during the training of the model. The probability is then computed by the number of the cases, for example 100*465/1827 gives approx. 25.46%. Actually, the decision tree calculates the probability in a different way. For many cases this is very close to the quotient above, but there are differences. If you want to be exactly sure to control the method of calculation you may want to calculate the probability score on your own (for the example above this means to calculate 100*465/1827 instead of taking the probability provided by the mining algorithm).

If we use this model for prediction, probabilities above 50% will be mapped to the positive result. For our example, we want to predict the “Bike Buyer” variable. For cases that fall into the node displayed above, the result will be 0 (as the probability for zero is greater than 50%) meaning that these contacts are unlikely to buy a bike.

In order to proceed, I focus on the prediction value of Bike Buyer = 1. I copy the results from our query above to Excel and removed all lines for result 0 or missing. I also did some formatting of the table. These are the first rows of the resulting table.

image

In order to decide whether this support is good or bad, we have to decide on which basis we want to do the decision:

  1. Decision based on the predicted value
  2. Decision based on the predicted values probability

In the first case, we’re only using the predicted value from our mining. In our example it would be a prediction of the Bike Buyer variable as a value of true or false. Our concern would be that our support does favor the alternative decision with a high probability.

In the second case, we’re taking the real prediction probability for some further calculation, for example for the back testing (I will discuss this during my next posts) or for calculation expectancy values of our prediction (for example expected costs).

Let’s start with the first case here.

 

1. Decision based on the predicted value

This case is based on some rule to derive the prediction variable from the prediction score and optionally further attributes. SSAS does this on a very simple basis by changing the decision at 50%. But the rule can be very different. You may consider all rows with probability >30% as potential bike buyers or you may include your estimated revenue. You could also use the profit analysis that is built into SQL Server Data Mining to decide on the rule. But in any case, for your mining results you are only interested on the output of the rule (not on the “real” probability), in our case “true” or “false”.

For our example, I’m using the “default” rule of 50%. In order to understand the effect of the support, let’s look at our concerns. First we start with a case having a probability higher than 50% (or whatever our threshold is), for example take a look at line 3 of the Excel table from above. This case has a probability of 61.28% meaning that 65 our of 106 rows had the Bike Buyer variable set to 1.In this case, our concern is:

The real probability for “Bike Buyer =1”  is 50% or lower but still the random sample of data we used during training resulted in 65 positive rows.

If this happened, our model would be choosing the wrong decision for all rows that are matching the node’s criteria.

In order to deal with this concern statistically, I’m doing a simple and common trick here (maybe you should allow yourself some minutes to think about that). Let’s change the concern to

The real probability for “Bike Buyer =1”  is exactly 50% but still the random sample of data we used during training resulted in 65 or more positive rows.

For this concern we can simple use Excel’s binomial function to compute the probability:

1-BINOM.DIST(65,106,0.5,TRUE)

I used the Excel 2010 function here. If you’re using Excel 2007 or before, the function name is BINOMDIST. The above function returns a value of 0.7%. So it is very unlikely that our model training was based on a wrong decision.

For the lines with a probability of less than 50% our concerns are just the other way round (of course you could also take the same rules as above and look for “Bike Buyer=0”). For our example I reference the first line in the table above:

The real probability for “Bike Buyer = 1” in this specific node is 50% or higher but we still see 465 positive rows in our 1827 cases.

Again, I’m transforming this to:

The real probability for “Bike Buyer = 1” in this specific node is exactly 50% but we still see 465 or less positive rows in our 1827 cases.

In this case, we have to use this function to compute the probability:

BINOM.DIST(465,1827,0.5,TRUE)

Again the probability is very low (almost exactly 0%).

Let’s add this calculation to the Excel-table from above:

image

The formula for E7 is

=IF(C7<0.5,BINOM.DIST(B7,D7,0.5,TRUE),1-BINOM.DIST(B7,D7,0.5,TRUE))

This formula is then copied for all lines below.

How do we read this table? To keep things simple, low values in the last column are good (as in our two example values from above). The closer the predicted probability gets to 50% (or in other words, the weaker the prediction gets) the more support is needed to make it relevant.

How do you work with this table? Let’s say you want to be 95% sure that the support is strong enough to prevent our mining system from being trained for the wrong response. In this case you would filter out nodes from the table above that have a value of 5% or above in the last column. This is the result

image

In this case we would have filtered out 608 from 12939 cases used for the support (approx. 4.7%). What are you going to do with these cases? Well, this depends on your mining question. Typically you would consider these nodes to be bike buyers if you want to minimize the risk of loosing potential customers or you would consider them as not being bike buyers if your goal is to reduce costs (by excluding these cases from the mailing). In any case, you will have to ignore the response from the data mining process if you do not want to risk being misled.

In our case it is ok to drop 608 cases, but what do you do if all cases have a very poor support? In this case, you should review your model carefully. Seeing a lot of cases with a low support usually means that your model is “over trained”. For example, your decision tree model has too many nodes. Mining works best when the real world can be simplified. For the decision tree you might try to increase the required support or set a higher value for the complexity penalty parameter.

The more you want to be sure that your case support is good enough the more cases you will have to drop. For our example here are a few values:

image 

 

 

2. Decision based on the predicted values probability

In this case we’re not so much interested in a simple decision (bike buyer true or false) but more about the probability for each case. This also means we’re having to consider different concerns (although similar compared to the ones above). The whole process is very similar to the one above and if you are not too much interested in the outcomes you can just skip to the conclusion.

Basically, we only need to replace the 50% in the example above with an alternative model’s probability derived from the predicted probability. However, as described above, our concerns differ in the cases the favor a positive or negative result. So for our example, we’re considering a cut-off at 50% again.

Let’s start with the lines having a probability higher than 50%. For example, let’s look at the 3rd line from the table above. The probability found during training was 61.28% that is, 65 of the 106 rows hat the Bike Buyer variable set to 1.

In this situation our concern is that the real probability is less than 61.28% but still the random sample we used during training resulted in 65 positive rows. Ok, if the probability is still 61.279% this wouldn’t make much difference. Let’s try with a difference of 3%, so our concern (already transformed as described above) is:

The real probability for “Bike Buyer = 1” in this specific node is exactly 58.28% but we still see 65 or more positive rows in our 106 cases.

Again we’re using Excel’s binomial function to compute the probability:

1-BINOM.DIST(65,106,0.5828,TRUE)

The above functions returns a value of 23.26%. Although it is below 50% it is still significant (compare this to the 0.7% from above).

For the lines with a probability of less than 50% our concerns are just the other way round. For our example I use the first line from the table above as a reference again:

The real probability for “Bike Buyer = 1” in this specific node is exactly 28.46% but we still see 465 or less positive rows in our 1827 cases.

To compute this in Excel you have to take the following formula:

BINOM.DIST(465,1827,0.2846,TRUE)

In the case we get a result of 0.22% meaning it is extremely unlikely.

I added this calculation as an additional column to my Excel table from above. This is the result:

image

The formula for E7 is

=IF(C7<0.5,BINOM.DIST(B7,D7,C7+$B$4,TRUE),1-BINOM.DIST(B7,D7,C7-$B$4,TRUE))

This formula is then copied for all lines below.

Again we can now filter the table for whatever sureness we would like to have. If we want to be 80% sure that your case support does not favor a model that is 3% off, we would filter all lines from the table above that are having a value higher than 20% in the last column.

image

A lot or nodes do remain in this case (a lot more than in our discrete example from above): In this case we would have to ignore about 19.4% of our predictions.

Here are some sample results for the cases that need to be left out:

image

As expected, the more you want to be sure the more cases you have to drop. Also the less tolerance you allow for your model, the more cases you have to drop. So it’s up to you to find the best mix of security and usability of the model.

  • Share/Bookmark

  • Kategorien

  • Copyright © ORAYLIS Blog. All rights reserved.