Books_20170913

Edge gone to hell, or seasonality?

It’s been some time since the last post, reason is that I had to put some time into my betting models… My problems started in march when my bank started to slowly decrease instead of increase AS SUPPOSED to.

First I kept calm and let the bots run unchanged (AS SUPPOSED),  but the bad trend continued into April and from there into May. In this period there were now quite many bets generated ( 2616 in the period 1.March to 15.May).

As seen in the table above I have been seriously hurt in the Draw and Away models, and the yield is -1.3% which is exactly my long term yield target BUT WITH THE WRONG SIGN.

I am getting older and older and my memory might not be as sharp as earlier (!), but I had a hunch that my models performed real bad in the same period last year. So I dived into my old betting history and took out the performance for the same period in 2016:

OK, my hunch was correct – I had shitty performance in the same period last year (-1.08% yield).

As a final check I plotted the accumulated yield% as a function of week to see if there is some kind similar behavior:

It is now getting quite obvious that I have a repeating pattern between 2016 and 2017. Great performance in January and February, terrible in March, April. If this pattern continues I will see better performance for the rest of the year.

I am still not sure what generates this “problem”, my guess is that its related to the mix of leagues (some starts and stops in March and April), and I don’t take that fact into my current models. So, as a start to address this issue (and hopefully avoid the same pattern in 2018) I have now developed a “League classification rating” which I intend to implement for the third quarter this year.

Some details about my new rating model will be presented in a later post.

 

post_20161018_1

Model of expected overtime in soccer

A couple of years ago I wrote a post (here) regarding the average match length in soccer matches, based on Betfair data. A few days ago I got a replay from David with an explanation on my findings (game length decreases by number of goals). It actually made me curious to pick up this subject once more! In my ordinary data I don’t have the exact additional time (delivered by the fourth referee), so it can actually be of great value to me to predict how long the match will be. So instead of always guessing that there are 2.5 minutes added, I can hopefully differentiate that and get a better guess.

This time I will take it one step further and build a predictive model of the remaining game-time given that we just reached the full ordinary time (90 minutes).

The variables I will try in my model are:

1. Number of total goals (k59)
2. Number of total red cards (k60)
3. Number of total yellow cards (k61)
4. Absolute goal difference (k44_grp)

The short name in parenthesis is the short name that I use in my data, so instead of me renaming my whole database you can look them up in the table above.

First I need to chose a model, I look at the distribution and it seems like a Gamma distribution could be used:

post_20161018_1

I decide to go with a GLM using the underlying Gamma distribution, and I use the ending game-minute as response. I put all the four variables into my model, and estimate. I get:

post_20161018_2

All four variables included are significant, and “number of yellow cards” and “absolute goal difference” are the most explaining ones. The estimation is done and the expected game-length is given by:

E[game-length]= 1/exp(- (4,5281 – ‘total goals’*0,0002 + ‘number of red cards’*0,0014 + ‘number of yellow cards’*0,0012 -‘absolute goal difference’*0,0033))

I do a quick sanity check of the model by plotting expected game-length as a function of ‘absolute goal difference’ (locking the values for red cards = 0 and yellow cards = 4, corresponding to their rounded averages). In the same graph I plot the one-way averages (the “real” average game-length for each ‘absolute goal difference’ with red and yellow cards being what they were).

post_20161018_3

There it is, a reasonable and logical model of game-length – giving me a better tool than assuming 92.5 minutes for all matches!

Books_20170913

Evaluating the model

This is the forth post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model
5.Implementation

In the previous post we created a model as good as we could, we know that model is statistically “fair” but we are not sure if that means the model is long-term profitable (got an edge in the market). So we need to test it on out of sample data!

After the creation of my dataset I divided it onto two parts, one for modelling and one for out of sample simulation. I used 2014 and 2015 data to model, and selected 2016 as simulation set.

I have now implemented the new model in SAS, and start out with the rule: create a back signal IF the offered odds is at least my calculated odds plus 1%. I also have a suspicion that if my model deviates to much from the market price, then there is a flaw in my model and not in the market. I plot the accumulated ROI by different “edges” (or deviation from market price) together with the number of bets induced by the model in 2016.

post_20160911_1

I use linear regression to get a feel of how to model behaves, and conclude that a higher edge is probably just a sign of me missing some vital information in my model. This model is probably OK for small deviations from market, so I restrict the model to work in the range of edges below 4 %. I choose 4 % because at that point I should get about 1 % ROI (by looking at the graph above and seeing where the regression line crosses 1 % ROI).

Running the model through with this restriction to see how the 118 bets are distributed by month:

post_20160911_2

It is a small sample, but the model implies a 1.1% ROI when simulated on 2016 (out of sample) data – It might be a model worth going further with!

In “real life” I do much more extensive testing, I look for strange things by splitting my simulated data by different variables and try to find all possible areas where the model doesn’t work. It is not as easy as just finding unprofitable segments, they also need to be logical and explainable. Limiting the model on historical data might risk creating an over adapted model with poor predictive power.

In the next post I will write about the implementation phase.

Books_20170913

Testing and optimising the model

This is the third post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model
5.Implementation

It is now time to fit the model with my explanatory variables. It means finding the coefficients Bk that best fits:

Logistic regression equation

When feeding my model with my dataset and fitting the model (using SAS but there are several tools out there such as R, Matlab and Excel) I get these results:

Post_20160831_1

K43 is the game minute and g15 is the pre-game favourite.

There are several methods to see how well your model fits the data (so that you can compare models and chose the best one). One of my favourite evaluation method is to check the ROC curve:

Post_20160831_2

The concept is that you check your models ability to guess your response in comparison to make a random guess. The diagonal line is equivalent to a random guess, and the blue curve is the lift of my model (which is much better than a random guess, jippi!). The area under the curve is a measure of how big the lift is and therefore how good the fit is. I found a point system to see how good your model is:

.90-1 = excellent (A)
.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)

So, the model is “fair” … Is that enough to win money?!

Now we have a model to describe the probability for a away win (with 2 goals ahead) given game minute and who was pre-game favourite, and plotted it looks like:
Post_20160831_3
If you are unsure how to convert the formula into a odds I will give you an example.

Lets say you are in game minute 50 and the home-team was pre-game favourite. Then you get:

Log Odds = 4.3327 + 50*0.0247 – 2.8406 = 2.7172

To get a probability you take 1/(1+exp(-log odds)) = 0.939. To get the odds we take 1/probability = 1/0.939 = 1.065. That is our own estimated odds for this case. By the same principles you can calculate all the combinations of game minutes and pre-game favourite and get your own estimated odds!

In next post I will continue to evaluate the model!

 

Books_20170913

Setting up the model

This is the second post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model
5.Implementation

rosengard-jpg
To create a good predictive model you first need to decide what to model. It might sound very fundamental but the choices are many, you can for example model the movement in odds between two time points or maybe model the probability of one more goal in the next 5 minutes given the fact that the home-team just scored 1-0… It is my opinion that if you want to find some edge in relation to the market, try to model a niche event and be the best in that niche. I spent a few years trying to build models to cover the general behaviour in in-play soccer, and some of my models were for sure at least descent but in betting descent is not sufficient – they need to be excellent to consistently beat the market! It was first when I started to niche my models that I found edge. Rather dominate a small part of the market than be dominated by the big market 🙂

So lets do something for real – I am writing this blog post and preparing my data at the same time. I will tell you exact what I am doing and which results I get out of it.

First thing first: I need to define an area where to create a model. I decide to create a model that operates in:

=> League matches in in-play Women’s soccer

That is how I partition my data. Now I need to create my response variable, a “question” which has a binary (0 or 1) outcome. I decide that the binary outcome I want to model is:

=> Will the away team win at full-time from a 2 goal lead? Given the fact that the latest goal was scored in the last 5 minutes.

Now the data needs some work to get going (as usual…), I select all the cases from my history where the away team is at 2 goals lead anywhere in the match. I create the response variable as “1” for all those cases were 2 goals lead ends with a win, and “0” for all those cases were its a draw or a loss.

When the data is all organised as I want it, I finally start building my model (this is the really fun part!). The concept is that I have my response variable (as created above) and I now need to find input variables that helps to explain that response. Explanatory variables can for example be game minute, number of red-cards, pre-game favourite and so on. In my data I have about 50 potential variables, some of them are “hard coded facts” such as game minute and some are variables created by me (for example team form).

I now start testing for significant variables by using a rough screening method (for example forward selection or backward elimination), in this case I get the following significant variables:

1. Game minute
2. Pre-game favourite

In this case I only get two significant variables to work with, its not much but I am trying to model a quite rare event (meaning few data) and only a few years available history.

Next step is to actually consider your significant variables. Do they really explain the given response? Building a model with a non logic variable (even if its significant) will only create a worse model when it comes to prediction. In this case I consider both of them as logical so I keep them both in my model.

That is the end of this blog post. In the next post I will continue to create and optimise my model.

Books_20170913

Modelling soccer with Logistic Regression #1

For many years I have been trying to find the right model to describe a certain event in soccer, and use it to calculate the probability of the event as accurate as possible. If my calculations are more accurate than the market, I will be be able to back or lay odds that are more favourable than they should be – That equals edge!

When I started to model soccer I started with Poisson Regression. This model is good for count data, such as number of goals scored by a team (meaning that the number of goals in a soccer match is approximately Poisson distributed). I used this model to estimate probability that home team scores 0,1,2,3… goals and estimate that the away team scores 0,1,2,3 .. goals. When I have those probabilities I can easily sum up all the cases where the home team wins (1-0,2-1,3-1,3-2 and so on) to get a total probability for a home win.

When I get the probability I just take 1/probability to get the implied odds. I then add some margin to the odds (=my expected edge) and use that as a target odds.

However, I did not really have any success with my Poisson model (despite working with it for quite some time). Instead I managed to build a successful model when I changed model and renewed my way of thinking.

In five upcoming blog posts I will walk you through the steps I take to create a model:

  1. This post – Introduction to Logistic Regression
  2. Setting up a model
  3. Testing and optimising the model
  4. Evaluating the model
  5. Implementation

So, a few word about my model of choice:

The logistic regression is fantastic when you want to model the probabilities of a response variable as a function of some explanatory variables.

Logistic regression generates the coefficients of a formula to predict a logit transformation of the probability of presence of the characteristic of interest:

Logistic regression equation

where p is the probability of presence of the characteristic of interest. The logit transformation is defined as the logged odds:

Odds=p/(1-p)

and

Logit(p)=ln(p/(1-p))

Rather than choosing parameters that minimize the sum of squared errors (like in ordinary regression), estimation in logistic regression chooses parameters that maximize the likelihood of observing the sample values.

If you are a SAS user you might find this code helpful:

proc logistic data = moddata outmodel = mod_param;
class X1 X2 X3;
model response (event = ‘1’) = X1 X2 X3;
output out = modelout predprobs = i;
ods output ParameterEstimates=betas;
run;

… where response is the binary output that you want to model and Xn are the explanatory variables.

Next post in this series will contain some help and inspiration how to set up a model.