Top10 supported soccer teams – Looking for edges – I/III

I decided to look into the European soccer teams with the biggest supporter base. Doing a quick google on the subject I found the top10 teams to be:

1. Manchester United
2. Real Madrid
3. FC Barcelona
4. Chelsea
5. Arsenal
6. Liverpool
7. FC Bayern Munich
8. AC Milan
9. Juventus
10. Paris Saint-Germain

I found the list at, it is their ranking (based on followers on social media, TV-viewership, shirt sales and sponsorship deals), and I decided to go with that. I have three questions I want to explore:

1. If many “play-for-fun” supporters bet on their favourite team, can laying them pre-game be a winning strategy?
2. What if one of these teams take a one goal lead in-play. Should I back or lay?
3. What if their opponent takes the lead by one goal. Should I back or lay?

I will split this subject into three different blog post, starting with the pre-game question. My hypotheses is that when big amounts of supporter money hits the line the odds on the big team will get eaten, therefore there might be an edge laying the big team.

I use my own recorded data on Betfair with the following notes:

1. The pre-game odds that I use is recorded a few minutes before the kick-off.
2. If it’s a match between two top10 teams, then it’s excluded.
3. Data from 2013 until today.
4. Not complete data, some matches have not been recorded and I know nothing about them.

The result I get is seen in the the table below:


I am looking for a strategy that have been consistent over time, and where I have some exposure. Both Premier League and Primera Division have a history of positive yield when laying any of the top10 teams. Maybe this could be an interesting pre-game strategy for 2017? It’s big markets and you can probably get large sums matched pre-game.


Tools and infrastructure for analysing big data



Its been some time since last update, I have been spending my (very limited) spare time recoding my whole analysis infrastructure. I use the PostgreSQL database with my bot, but it just dumps raw data into a database. I need to adjust, clean and process my data – something I have been using SAS to do earlier. SAS is a great tool for analysing data, but not that great when it comes to structuring millions and millions of lines. So I have rewritten my SAS code into SQL code instead. I will now create all my data tables (used in analysis) by SQL code into my database, this have increased the speed of creating my tables.

For analysis I will use the free edition of SAS (University edition), it is basically made for training so it has some limitations. One of the things I stumbled into was the issue of connecting my database to SAS and reading directly from it. It is NOT featured in the free edition, but it can be overruled by a few simple tricks. Download the PC Files Server for SAS, run it in the background and libname it with something like:
LIBNAME z PCFILES SERVER=computername port=9621 DSN=raw USER=username PASSWORD=xx;

This unlocks the free edition to actually being useful for big volume data analysis. As I get more and more data I also have to sharpen the infrastructure so that it’s manageable in size and time.

As I a last improvement I am going to write a simulation module in VBA costume made for my needs, sometimes it’s just not possible to fit all my needs into a fabricated software.

I have now spent many hours to create some analysis tables which are exactly the same as earlier, not as funny as improving models and looking for areas with edge but with the improved structure I will be in a better position when I start to analyse. Do the hard work and hopefully get the benefits later 🙂






Model of expected overtime in soccer

A couple of years ago I wrote a post (here) regarding the average match length in soccer matches, based on Betfair data. A few days ago I got a replay from David with an explanation on my findings (game length decreases by number of goals). It actually made me curious to pick up this subject once more! In my ordinary data I don’t have the exact additional time (delivered by the fourth referee), so it can actually be of great value to me to predict how long the match will be. So instead of always guessing that there are 2.5 minutes added, I can hopefully differentiate that and get a better guess.

This time I will take it one step further and build a predictive model of the remaining game-time given that we just reached the full ordinary time (90 minutes).

The variables I will try in my model are:

1. Number of total goals (k59)
2. Number of total red cards (k60)
3. Number of total yellow cards (k61)
4. Absolute goal difference (k44_grp)

The short name in parenthesis is the short name that I use in my data, so instead of me renaming my whole database you can look them up in the table above.

First I need to chose a model, I look at the distribution and it seems like a Gamma distribution could be used:


I decide to go with a GLM using the underlying Gamma distribution, and I use the ending game-minute as response. I put all the four variables into my model, and estimate. I get:


All four variables included are significant, and “number of yellow cards” and “absolute goal difference” are the most explaining ones. The estimation is done and the expected game-length is given by:

E[game-length]= 1/exp(- (4,5281 – ‘total goals’*0,0002 + ‘number of red cards’*0,0014 + ‘number of yellow cards’*0,0012 -‘absolute goal difference’*0,0033))

I do a quick sanity check of the model by plotting expected game-length as a function of ‘absolute goal difference’ (locking the values for red cards = 0 and yellow cards = 4, corresponding to their rounded averages). In the same graph I plot the one-way averages (the “real” average game-length for each ‘absolute goal difference’ with red and yellow cards being what they were).


There it is, a reasonable and logical model of game-length – giving me a better tool than assuming 92.5 minutes for all matches!


Match size potential

One of the most important things to get rich on betting is NOT to create the best model – Instead you need to balance your model with how much liquidity that is available in the market. No use of calculating the most accurate odds, load it with your margin, just to find out that there is no one in the market to take your bets.


I know for my self that I haven’t really been on top of this issue, the reason mainly for betting with low stakes (and therefore almost all the time getting matched). As my account has grown the requested bet size has become bigger and bigger and it will eventually be an issue that I need to address with more intelligence.


I made a graph to see how much I manage to get matched (as percentage) by different requested amounts:


The “amount asked” is rounded to nearest 250 interval, so 0 means asked amounts < 125 SEK. Just as expected there is a quite obvious trend that shows the problem of getting full amounts through at higher stakes.


I’ve also added a linear trend so that I can make a very simple prediction of my match% as my account (and stakes) grows. Actually when playing around with the trend I found out that the exponential trend had the best fit so I use:

y= 0,8412*exp(-0,06x) where x equals my group (0=0 SEK,1=250 SEK, 2=500 SEK …). Extrapolating this on higher stakes gives:


Ouch! I hope that the curve flattens out more when reaching higher stakes in practise.

One reason to be bothered about this fact is that I suspect that I risk getting more matched in parts of my model where there is a lower expected value, meaning that the ROI of my model will eventually also decline as I reach higher stakes. So far I have no evidence of this, but in my mind it is logical.

I will follow up on this, I expect to reach the 2000 and 3000 stakes within a year and will then return to this subject.




Betting results 2016-Q3

It is time to sum up the performance of my bot during the last three months (also known as the third quarter…). It has been a really good run, the model has performed above expectation in both ROI and in turnover.

Breaking the quarter down on bet type:


We can conclude that the bot struggles to get the Away algorithm to be profitable, it manages to get it just into the green (100.2% ROI). Although that poor performance it manages to get 101.75% ROI on the total, which is above expectation (I aim for around 101%).

In the start of this fourth quarter I will do some adjustment to the away model and hopefully bring it into profit when closing to book for that quarter. I have estimated a new model, with one more explanatory variable then old model, that seems very promising. I will keep the other models untouched (don’t fix it if it ain’t broken…).

Looking back a couple of years I new realise that 2016-Q3 is the seventh quarter in a row with positive results:



This is of course very pleasing, and clearly indicates that I have an edge in the market. These kind of results only makes me want to work harder and improving the models – Now I know that it is possible and that I am on the right track (there have been many times over the years when I have been on the border of giving up).

finally looking at the yearly table:


2016 so far has turned 3.8 MSEK, earned 41.7 KSEK at ROI of 101.09%. Looking back in my old blog ( to see what my expectations were for 2016:

“So what do I hope for 2016? Its not of any value to have financial goals, I will just try to improve the model, bot and risk management as much as possible and hope for the best… But walking into 2016 with much better starting point than in 2015, a reasonable guess would be to turn at least 4 MSEK, and reach a ROI of 1 %. If that’s the case it would mean a profit around 40000 SEK.”

I am ahead!



Big data challenges


As a bot runner I have the wonderful possibility to collect a lot of data. As an analyst I can not say no to data, it is the engine in all of my analysis – and a bigger engine is always better! In practise this means that I collect all available data.

What does it mean? Lets go through some figures:

1. I follow up to 500 matches simultaneously
2. Each match is recorded at least once a minute, during some circumstances they are recorded once every 5 seconds.
3. Every record generates one line and 143 columns with different types of data.


I load all of my main data into one big table (as we speak its about 26 GB big with about 26 million rows).

In my current hardware set up I run into problems now and then. When following a lot of matches (+200) and when many of them are in the 5 seconds update phase the writing of data into the PostgreSQL database is not fast enough. At those times my bot can not write every 5 seconds, sometimes the update frequency drops to 10-15 seconds instead! This is not good enough and I need to improve the performance.

What are my options?

1. Improving my code – Put more advanced routines to handle and write data in the background in my bot code. I am not really a big fan of this as my expertise in coding is on amateur level, but I will try to do some improvements.
2. Rearranging my PostgreSQL database – This I will need to do. I will partly create a new structure, keeping more tables but with smaller size.
3. Upgrading hardware – This I will also do. I don’t think that the problem is in the processor, but rather in the hard disk. So I will change my HDD into a SSD, it writes data much faster.
4. Moving to VPN – Might be an option in the future, if my bot performs OK there are several benefits to not use my basement located home computer any more, such as speed and stability.




This is the fifth post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model

Lets assume we have an implementable model. The implementation phase have shown many times to be a real challenge for me, small errors in the implemented model have generated huge miss-pricing in the model. Just as one example I accidental used betting stakes at 50% of my capital instead of 5%…. just a pure miracle that I didn’t empty my bank (instead it actually became very profitable by pure luck, but I strive to replace luck with skills :)).

To avoid problems and to discover errors I do :

– Lower the betting stakes on the new implemented model
– Cross reference the bets generated by the bot with simulated bets from another system
– Implement one model at a time
– Implementation phase limited to one day every quarter

The cross reference is done by running my model in both VB environment as well as SAS environment, the VB model executes the bets and the SAS model works as a reference. As soon as the calculations deviates I get notified.

When it comes to the betting stakes I currently run a normal model at 3% of my capital on each bet, a newly implemented model runs on 0.3% instead.

Down below you see a picture of my bot in action tonight at around 22.00. In a later post I will guide you through the structure and features of my bot. It is really cool!



Evaluating the model

This is the forth post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model

In the previous post we created a model as good as we could, we know that model is statistically “fair” but we are not sure if that means the model is long-term profitable (got an edge in the market). So we need to test it on out of sample data!

After the creation of my dataset I divided it onto two parts, one for modelling and one for out of sample simulation. I used 2014 and 2015 data to model, and selected 2016 as simulation set.

I have now implemented the new model in SAS, and start out with the rule: create a back signal IF the offered odds is at least my calculated odds plus 1%. I also have a suspicion that if my model deviates to much from the market price, then there is a flaw in my model and not in the market. I plot the accumulated ROI by different “edges” (or deviation from market price) together with the number of bets induced by the model in 2016.


I use linear regression to get a feel of how to model behaves, and conclude that a higher edge is probably just a sign of me missing some vital information in my model. This model is probably OK for small deviations from market, so I restrict the model to work in the range of edges below 4 %. I choose 4 % because at that point I should get about 1 % ROI (by looking at the graph above and seeing where the regression line crosses 1 % ROI).

Running the model through with this restriction to see how the 118 bets are distributed by month:


It is a small sample, but the model implies a 1.1% ROI when simulated on 2016 (out of sample) data – It might be a model worth going further with!

In “real life” I do much more extensive testing, I look for strange things by splitting my simulated data by different variables and try to find all possible areas where the model doesn’t work. It is not as easy as just finding unprofitable segments, they also need to be logical and explainable. Limiting the model on historical data might risk creating an over adapted model with poor predictive power.

In the next post I will write about the implementation phase.


Testing and optimising the model

This is the third post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model

It is now time to fit the model with my explanatory variables. It means finding the coefficients Bk that best fits:

Logistic regression equation

When feeding my model with my dataset and fitting the model (using SAS but there are several tools out there such as R, Matlab and Excel) I get these results:


K43 is the game minute and g15 is the pre-game favourite.

There are several methods to see how well your model fits the data (so that you can compare models and chose the best one). One of my favourite evaluation method is to check the ROC curve:


The concept is that you check your models ability to guess your response in comparison to make a random guess. The diagonal line is equivalent to a random guess, and the blue curve is the lift of my model (which is much better than a random guess, jippi!). The area under the curve is a measure of how big the lift is and therefore how good the fit is. I found a point system to see how good your model is:

.90-1 = excellent (A)
.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)

So, the model is “fair” … Is that enough to win money?!

Now we have a model to describe the probability for a away win (with 2 goals ahead) given game minute and who was pre-game favourite, and plotted it looks like:
If you are unsure how to convert the formula into a odds I will give you an example.

Lets say you are in game minute 50 and the home-team was pre-game favourite. Then you get:

Log Odds = 4.3327 + 50*0.0247 – 2.8406 = 2.7172

To get a probability you take 1/(1+exp(-log odds)) = 0.939. To get the odds we take 1/probability = 1/0.939 = 1.065. That is our own estimated odds for this case. By the same principles you can calculate all the combinations of game minutes and pre-game favourite and get your own estimated odds!

In next post I will continue to evaluate the model!



Setting up the model

This is the second post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model

To create a good predictive model you first need to decide what to model. It might sound very fundamental but the choices are many, you can for example model the movement in odds between two time points or maybe model the probability of one more goal in the next 5 minutes given the fact that the home-team just scored 1-0… It is my opinion that if you want to find some edge in relation to the market, try to model a niche event and be the best in that niche. I spent a few years trying to build models to cover the general behaviour in in-play soccer, and some of my models were for sure at least descent but in betting descent is not sufficient – they need to be excellent to consistently beat the market! It was first when I started to niche my models that I found edge. Rather dominate a small part of the market than be dominated by the big market 🙂

So lets do something for real – I am writing this blog post and preparing my data at the same time. I will tell you exact what I am doing and which results I get out of it.

First thing first: I need to define an area where to create a model. I decide to create a model that operates in:

=> League matches in in-play Women’s soccer

That is how I partition my data. Now I need to create my response variable, a “question” which has a binary (0 or 1) outcome. I decide that the binary outcome I want to model is:

=> Will the away team win at full-time from a 2 goal lead? Given the fact that the latest goal was scored in the last 5 minutes.

Now the data needs some work to get going (as usual…), I select all the cases from my history where the away team is at 2 goals lead anywhere in the match. I create the response variable as “1” for all those cases were 2 goals lead ends with a win, and “0” for all those cases were its a draw or a loss.

When the data is all organised as I want it, I finally start building my model (this is the really fun part!). The concept is that I have my response variable (as created above) and I now need to find input variables that helps to explain that response. Explanatory variables can for example be game minute, number of red-cards, pre-game favourite and so on. In my data I have about 50 potential variables, some of them are “hard coded facts” such as game minute and some are variables created by me (for example team form).

I now start testing for significant variables by using a rough screening method (for example forward selection or backward elimination), in this case I get the following significant variables:

1. Game minute
2. Pre-game favourite

In this case I only get two significant variables to work with, its not much but I am trying to model a quite rare event (meaning few data) and only a few years available history.

Next step is to actually consider your significant variables. Do they really explain the given response? Building a model with a non logic variable (even if its significant) will only create a worse model when it comes to prediction. In this case I consider both of them as logical so I keep them both in my model.

That is the end of this blog post. In the next post I will continue to create and optimise my model.