Setting up the model

This is the second post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model

To create a good predictive model you first need to decide what to model. It might sound very fundamental but the choices are many, you can for example model the movement in odds between two time points or maybe model the probability of one more goal in the next 5 minutes given the fact that the home-team just scored 1-0… It is my opinion that if you want to find some edge in relation to the market, try to model a niche event and be the best in that niche. I spent a few years trying to build models to cover the general behaviour in in-play soccer, and some of my models were for sure at least descent but in betting descent is not sufficient – they need to be excellent to consistently beat the market! It was first when I started to niche my models that I found edge. Rather dominate a small part of the market than be dominated by the big market 🙂

So lets do something for real – I am writing this blog post and preparing my data at the same time. I will tell you exact what I am doing and which results I get out of it.

First thing first: I need to define an area where to create a model. I decide to create a model that operates in:

=> League matches in in-play Women’s soccer

That is how I partition my data. Now I need to create my response variable, a “question” which has a binary (0 or 1) outcome. I decide that the binary outcome I want to model is:

=> Will the away team win at full-time from a 2 goal lead? Given the fact that the latest goal was scored in the last 5 minutes.

Now the data needs some work to get going (as usual…), I select all the cases from my history where the away team is at 2 goals lead anywhere in the match. I create the response variable as “1” for all those cases were 2 goals lead ends with a win, and “0” for all those cases were its a draw or a loss.

When the data is all organised as I want it, I finally start building my model (this is the really fun part!). The concept is that I have my response variable (as created above) and I now need to find input variables that helps to explain that response. Explanatory variables can for example be game minute, number of red-cards, pre-game favourite and so on. In my data I have about 50 potential variables, some of them are “hard coded facts” such as game minute and some are variables created by me (for example team form).

I now start testing for significant variables by using a rough screening method (for example forward selection or backward elimination), in this case I get the following significant variables:

1. Game minute
2. Pre-game favourite

In this case I only get two significant variables to work with, its not much but I am trying to model a quite rare event (meaning few data) and only a few years available history.

Next step is to actually consider your significant variables. Do they really explain the given response? Building a model with a non logic variable (even if its significant) will only create a worse model when it comes to prediction. In this case I consider both of them as logical so I keep them both in my model.

That is the end of this blog post. In the next post I will continue to create and optimise my model.

Leave a Reply