Testing and optimising the model

This is the third post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model

It is now time to fit the model with my explanatory variables. It means finding the coefficients Bk that best fits:

Logistic regression equation

When feeding my model with my dataset and fitting the model (using SAS but there are several tools out there such as R, Matlab and Excel) I get these results:


K43 is the game minute and g15 is the pre-game favourite.

There are several methods to see how well your model fits the data (so that you can compare models and chose the best one). One of my favourite evaluation method is to check the ROC curve:


The concept is that you check your models ability to guess your response in comparison to make a random guess. The diagonal line is equivalent to a random guess, and the blue curve is the lift of my model (which is much better than a random guess, jippi!). The area under the curve is a measure of how big the lift is and therefore how good the fit is. I found a point system to see how good your model is:

.90-1 = excellent (A)
.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)

So, the model is “fair” … Is that enough to win money?!

Now we have a model to describe the probability for a away win (with 2 goals ahead) given game minute and who was pre-game favourite, and plotted it looks like:
If you are unsure how to convert the formula into a odds I will give you an example.

Lets say you are in game minute 50 and the home-team was pre-game favourite. Then you get:

Log Odds = 4.3327 + 50*0.0247 – 2.8406 = 2.7172

To get a probability you take 1/(1+exp(-log odds)) = 0.939. To get the odds we take 1/probability = 1/0.939 = 1.065. That is our own estimated odds for this case. By the same principles you can calculate all the combinations of game minutes and pre-game favourite and get your own estimated odds!

In next post I will continue to evaluate the model!



Setting up the model

This is the second post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model

To create a good predictive model you first need to decide what to model. It might sound very fundamental but the choices are many, you can for example model the movement in odds between two time points or maybe model the probability of one more goal in the next 5 minutes given the fact that the home-team just scored 1-0… It is my opinion that if you want to find some edge in relation to the market, try to model a niche event and be the best in that niche. I spent a few years trying to build models to cover the general behaviour in in-play soccer, and some of my models were for sure at least descent but in betting descent is not sufficient – they need to be excellent to consistently beat the market! It was first when I started to niche my models that I found edge. Rather dominate a small part of the market than be dominated by the big market 🙂

So lets do something for real – I am writing this blog post and preparing my data at the same time. I will tell you exact what I am doing and which results I get out of it.

First thing first: I need to define an area where to create a model. I decide to create a model that operates in:

=> League matches in in-play Women’s soccer

That is how I partition my data. Now I need to create my response variable, a “question” which has a binary (0 or 1) outcome. I decide that the binary outcome I want to model is:

=> Will the away team win at full-time from a 2 goal lead? Given the fact that the latest goal was scored in the last 5 minutes.

Now the data needs some work to get going (as usual…), I select all the cases from my history where the away team is at 2 goals lead anywhere in the match. I create the response variable as “1” for all those cases were 2 goals lead ends with a win, and “0” for all those cases were its a draw or a loss.

When the data is all organised as I want it, I finally start building my model (this is the really fun part!). The concept is that I have my response variable (as created above) and I now need to find input variables that helps to explain that response. Explanatory variables can for example be game minute, number of red-cards, pre-game favourite and so on. In my data I have about 50 potential variables, some of them are “hard coded facts” such as game minute and some are variables created by me (for example team form).

I now start testing for significant variables by using a rough screening method (for example forward selection or backward elimination), in this case I get the following significant variables:

1. Game minute
2. Pre-game favourite

In this case I only get two significant variables to work with, its not much but I am trying to model a quite rare event (meaning few data) and only a few years available history.

Next step is to actually consider your significant variables. Do they really explain the given response? Building a model with a non logic variable (even if its significant) will only create a worse model when it comes to prediction. In this case I consider both of them as logical so I keep them both in my model.

That is the end of this blog post. In the next post I will continue to create and optimise my model.


Modelling soccer with Logistic Regression #1

For many years I have been trying to find the right model to describe a certain event in soccer, and use it to calculate the probability of the event as accurate as possible. If my calculations are more accurate than the market, I will be be able to back or lay odds that are more favourable than they should be – That equals edge!

When I started to model soccer I started with Poisson Regression. This model is good for count data, such as number of goals scored by a team (meaning that the number of goals in a soccer match is approximately Poisson distributed). I used this model to estimate probability that home team scores 0,1,2,3… goals and estimate that the away team scores 0,1,2,3 .. goals. When I have those probabilities I can easily sum up all the cases where the home team wins (1-0,2-1,3-1,3-2 and so on) to get a total probability for a home win.

When I get the probability I just take 1/probability to get the implied odds. I then add some margin to the odds (=my expected edge) and use that as a target odds.

However, I did not really have any success with my Poisson model (despite working with it for quite some time). Instead I managed to build a successful model when I changed model and renewed my way of thinking.

In five upcoming blog posts I will walk you through the steps I take to create a model:

  1. This post – Introduction to Logistic Regression
  2. Setting up a model
  3. Testing and optimising the model
  4. Evaluating the model
  5. Implementation

So, a few word about my model of choice:

The logistic regression is fantastic when you want to model the probabilities of a response variable as a function of some explanatory variables.

Logistic regression generates the coefficients of a formula to predict a logit transformation of the probability of presence of the characteristic of interest:

Logistic regression equation

where p is the probability of presence of the characteristic of interest. The logit transformation is defined as the logged odds:




Rather than choosing parameters that minimize the sum of squared errors (like in ordinary regression), estimation in logistic regression chooses parameters that maximize the likelihood of observing the sample values.

If you are a SAS user you might find this code helpful:

proc logistic data = moddata outmodel = mod_param;
class X1 X2 X3;
model response (event = ‘1’) = X1 X2 X3;
output out = modelout predprobs = i;
ods output ParameterEstimates=betas;

… where response is the binary output that you want to model and Xn are the explanatory variables.

Next post in this series will contain some help and inspiration how to set up a model.



Market efficiency – looking for a niche

I calculated the implied probabilities from the pre-game odds (1/odds), and distributed the over-round evenly over the three scenarios. Then I summed them together with the actual wins.

This first table shows that the market had the most problems with the Away predictions. Market predicted the Away teams to win corresponding to 29027 times, but they actually only won 28192 times (which is a diff of 2.9%).

I dug a little deeper to see how the situation is when dividing by pre-game favourite:

Now I start to identify some areas which the market seems to have bigger problems with… such as the probability for an Away win when the pre-game favourite is a Home win.

This could be a starting point when designing a new strategy. Find a niche where the market has problems, and try to be the best in that niche 🙂 Now I will continue to mine my data, and see if I can find a new niche area.


Average matched sums by league

The average is calculated from 2015 and 2016 data, and the top 25 leagues/events are:

The average matched is in SEK. The Premier League stands out as the number one preferred betting markets with an average matched 44.5 MSEK per match on the Match Odds market. That is more than twice the amount in Primera Division and four times Serie A!

The Community Shield is the match between Arsenal and Chelsea held on 2.august 2015 (won by Arsenal with 1-0 by the way).

Especially the Scandinavian markets are interesting for me, and the top3 there are: