Big data challenges


As a bot runner I have the wonderful possibility to collect a lot of data. As an analyst I can not say no to data, it is the engine in all of my analysis – and a bigger engine is always better! In practise this means that I collect all available data.

What does it mean? Lets go through some figures:

1. I follow up to 500 matches simultaneously
2. Each match is recorded at least once a minute, during some circumstances they are recorded once every 5 seconds.
3. Every record generates one line and 143 columns with different types of data.


I load all of my main data into one big table (as we speak its about 26 GB big with about 26 million rows).

In my current hardware set up I run into problems now and then. When following a lot of matches (+200) and when many of them are in the 5 seconds update phase the writing of data into the PostgreSQL database is not fast enough. At those times my bot can not write every 5 seconds, sometimes the update frequency drops to 10-15 seconds instead! This is not good enough and I need to improve the performance.

What are my options?

1. Improving my code – Put more advanced routines to handle and write data in the background in my bot code. I am not really a big fan of this as my expertise in coding is on amateur level, but I will try to do some improvements.
2. Rearranging my PostgreSQL database – This I will need to do. I will partly create a new structure, keeping more tables but with smaller size.
3. Upgrading hardware – This I will also do. I don’t think that the problem is in the processor, but rather in the hard disk. So I will change my HDD into a SSD, it writes data much faster.
4. Moving to VPN – Might be an option in the future, if my bot performs OK there are several benefits to not use my basement located home computer any more, such as speed and stability.




This is the fifth post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model

Lets assume we have an implementable model. The implementation phase have shown many times to be a real challenge for me, small errors in the implemented model have generated huge miss-pricing in the model. Just as one example I accidental used betting stakes at 50% of my capital instead of 5%…. just a pure miracle that I didn’t empty my bank (instead it actually became very profitable by pure luck, but I strive to replace luck with skills :)).

To avoid problems and to discover errors I do :

– Lower the betting stakes on the new implemented model
– Cross reference the bets generated by the bot with simulated bets from another system
– Implement one model at a time
– Implementation phase limited to one day every quarter

The cross reference is done by running my model in both VB environment as well as SAS environment, the VB model executes the bets and the SAS model works as a reference. As soon as the calculations deviates I get notified.

When it comes to the betting stakes I currently run a normal model at 3% of my capital on each bet, a newly implemented model runs on 0.3% instead.

Down below you see a picture of my bot in action tonight at around 22.00. In a later post I will guide you through the structure and features of my bot. It is really cool!



Evaluating the model

This is the forth post in the following series:

1.Introduction to Logistic Regression
2.Setting up a model
3.Testing and optimising the model
4.Evaluating the model

In the previous post we created a model as good as we could, we know that model is statistically “fair” but we are not sure if that means the model is long-term profitable (got an edge in the market). So we need to test it on out of sample data!

After the creation of my dataset I divided it onto two parts, one for modelling and one for out of sample simulation. I used 2014 and 2015 data to model, and selected 2016 as simulation set.

I have now implemented the new model in SAS, and start out with the rule: create a back signal IF the offered odds is at least my calculated odds plus 1%. I also have a suspicion that if my model deviates to much from the market price, then there is a flaw in my model and not in the market. I plot the accumulated ROI by different “edges” (or deviation from market price) together with the number of bets induced by the model in 2016.


I use linear regression to get a feel of how to model behaves, and conclude that a higher edge is probably just a sign of me missing some vital information in my model. This model is probably OK for small deviations from market, so I restrict the model to work in the range of edges below 4 %. I choose 4 % because at that point I should get about 1 % ROI (by looking at the graph above and seeing where the regression line crosses 1 % ROI).

Running the model through with this restriction to see how the 118 bets are distributed by month:


It is a small sample, but the model implies a 1.1% ROI when simulated on 2016 (out of sample) data – It might be a model worth going further with!

In “real life” I do much more extensive testing, I look for strange things by splitting my simulated data by different variables and try to find all possible areas where the model doesn’t work. It is not as easy as just finding unprofitable segments, they also need to be logical and explainable. Limiting the model on historical data might risk creating an over adapted model with poor predictive power.

In the next post I will write about the implementation phase.