# Modelling soccer with Logistic Regression #1

For many years I have been trying to find the right model to describe a certain event in soccer, and use it to calculate the probability of the event as accurate as possible. If my calculations are more accurate than the market, I will be be able to back or lay odds that are more favourable than they should be – That equals edge!

When I started to model soccer I started with Poisson Regression. This model is good for count data, such as number of goals scored by a team (meaning that the number of goals in a soccer match is approximately Poisson distributed). I used this model to estimate probability that home team scores 0,1,2,3… goals and estimate that the away team scores 0,1,2,3 .. goals. When I have those probabilities I can easily sum up all the cases where the home team wins (1-0,2-1,3-1,3-2 and so on) to get a total probability for a home win.

When I get the probability I just take 1/probability to get the implied odds. I then add some margin to the odds (=my expected edge) and use that as a target odds.

However, I did not really have any success with my Poisson model (despite working with it for quite some time). Instead I managed to build a successful model when I changed model and renewed my way of thinking.

In five upcoming blog posts I will walk you through the steps I take to create a model:

1. This post – Introduction to Logistic Regression
2. Setting up a model
3. Testing and optimising the model
4. Evaluating the model
5. Implementation

So, a few word about my model of choice:

The logistic regression is fantastic when you want to model the probabilities of a response variable as a function of some explanatory variables.

Logistic regression generates the coefficients of a formula to predict a logit transformation of the probability of presence of the characteristic of interest:

where p is the probability of presence of the characteristic of interest. The logit transformation is defined as the logged odds:

and

Rather than choosing parameters that minimize the sum of squared errors (like in ordinary regression), estimation in logistic regression chooses parameters that maximize the likelihood of observing the sample values.

If you are a SAS user you might find this code helpful:

proc logistic data = moddata outmodel = mod_param;
class X1 X2 X3;
model response (event = ‘1’) = X1 X2 X3;
output out = modelout predprobs = i;
ods output ParameterEstimates=betas;
run;

… where response is the binary output that you want to model and Xn are the explanatory variables.

Next post in this series will contain some help and inspiration how to set up a model.