Hilmar Buchta

SQL Server 2016

In this post I’m demonstrating a use case for the new R Server (formerly known as Revolution R Enterprise) in SQL Server 2016 to calculate the expected value for CRM leads using a Monte Carlo approach.

To keep things simple, let’s assume we have the following leads in our CRM:

image

For example, the chance for winning lead number 7 with $1,000,000 is 10%. So what is the amount of incoming orders we can plan with (assuming the probabiltity for the individual lead is correct)? A common approach is to use a weighted sum (sum over probability times value), which is easy to calculate in T-SQL:

select sum([Probability]*[Value]) ExpectedValue from CRM

image

While this approach works well with a large number of leads of similar size, for the example above we have to realize that $100,000 of the $256,000 result from the relatively unlikely win of lead number 7. And in fact, we could win or loose this lead which means a value of 0 or 1 million but nothing in between. So this approach may be misleading with skewed data.

Another approach is to use a threshold and only count the leads with a probability above the threshold. The query would look somewhat like this:

select sum([Value]) ExpectedValue from CRM where [Probability]>=.7

image

Here we’re only counting leads with a probability of at least 70%. We just need to be sure not to understand the threshold of 70% as a probability here. It would be wrong to interpret the result in a way like “with a probability of 70% we can expect incoming orders of at least $52,000”. The reason is that each lead can be a win or loss independently from the other leads.

So, what could be a more realistic method to estimate the expected value of the leads from above? One idea could be to simulate cases where each lead can be converted in an order or not at the individual probability of the lead. If we run say 100,000 such simulations we can look at the distribution of the results to get a better understanding of the resulting total. This approach is called Monte Carlo method. While we could implement this in T-SQL (for example look at an older blog post of mine about Monte Carlo in T-SQL), it’s easier to do so in R and with the new R Server capabilities in SQL Server 2016 we can better use this to do the calculation (see here for the basics about T-SQL stored procedures in R).

Let’s start with the resulting procedure code before I go into more details:

EXEC sp_execute_external_script
@language = N’R‘
, @script = N‘
set.seed(12345)
eval<-function() {sum(ifelse(runif(min = 0, max=1, n=nrow(mydata))<=mydata$Probability, mydata$Value,0))}
r<-replicate(100000,eval())
q<-quantile(r,  probs = c(0.1, 0.5, 1, 2, 5, 10,25,50,100)/100)
result<-data.frame(q)
result$quantile<-rownames(result)

, @input_data_1 = N’select  CRMID, Probability, Value from CRM‘
, @input_data_1_name=N’mydata‘
, @output_data_1_name = N’result‘
WITH RESULT SETS (( 
  [value] float
  ,quantile nvarchar(10)
));

The R script itself is marked in blue here. I runs 100,000 random experiments on our input data. In each experiment, 7 (the number of rows in our dataset) evenly distributed random values. Only if the random value is below the given probability of the lead (which happens more rarely the smaller the value of the given probability is) the value is accounted. We then calculate quantiles and return the result as a SQL table.

Here is the result of this T-SQL statement:

image

How do we read this result? Here are some examples:

  • Line 6: In 10% of the cases, the value was below $52,000 and, consequently, in 90% of the cases, the value was above $52,000
  • Line 2:  In 99.5% of the cases that value was above $15,000
  • Line 5: In 95% of all cases the value was above $37,000

Or, in other words, at a confidence level of 90% we can assume to result in a value of at least $52,000 here. So this approach does not only give a single value but allows you to understand the expected result based on a given confidence.

Of course, T-SQL would not be a good choice to develop and test even a small R script as the one above. Usually when working with R you’re following a more interactive approach. I suggest developing the script in an interactive R tool like RStudio. In order to do so, I’m using same simple wrapper code to provide the data set from SQL Server as shown below:

library(RODBC)
db_connection <-odbcDriverConnect(
  paste(„Driver={SQL Server Native Client 11.0}“,
         „Server=localhost“,
         „Database=CRMTest“,
         „Trusted_Connection=yes“, sep=“;“)
)

mydata<-sqlQuery(db_connection, „select CRMID, [Probability], [Value] from CRM“)
mydata

#################################

set.seed(12345)
eval<-function() {sum(ifelse(runif(min = 0, max=1, n=nrow(mydata))<=mydata$Probability, mydata$Value,0))}
r<-replicate(100000,eval())
q<-quantile(r,  probs = c(0.1, 0.5, 1, 2, 5, 10,25,50,100)/100)
result<-data.frame(q)
result$quantile<-rownames(result)

#################################

result
odbcCloseAll()

Again, the code in blue is the final R code which we can copy over to our T-SQL function for production. The RStudio environment allows us to interactively develop the script. The surrounding code loads the data into a data frame with the same result as in our T-SQL sp_execute_external_script call. The method used by SQL Server is much more efficient, however for testing purposes the ODBC call is sufficient.

For example, we can plot a histogram in R:

hist(r,breaks=50)

image

This shows the distribution of our cases. For example, there are no cases that end with a value of between 500,000 and 1 million. Or we could plot a density function for the distribution:

h<-hist(r, plot = F, breaks=1000)
plot(x=h$breaks[-1], y=(100000-cumsum(h$counts))/100000, type=“l“, ylim=c(0,1))
cumsum(h$counts)

image

 

Conclusion

While  R is mostly known for machine learning and advanced statistical calculations it may also be useful for simple simulations like the one above where we analyzed the distribution of leads in CRM and calculated an expected value based on a confidence. Doing the same in T-SQL would result in quite a lot of SQL code which in turn makes it more difficult to read and understand the procedure (compared to our short R script). Another option would be to put the same code in a CLR library but then we would have to deploy the library separately instead of keeping the code in the database. However, developing the R code with SQL Server tools like Management Studio is not much fun. Instead, we used a short wrapper around the code to develop and test the code in an interactive R GUI like RStudio.