Tools and infrastructure for analysing big data



Its been some time since last update, I have been spending my (very limited) spare time recoding my whole analysis infrastructure. I use the PostgreSQL database with my bot, but it just dumps raw data into a database. I need to adjust, clean and process my data – something I have been using SAS to do earlier. SAS is a great tool for analysing data, but not that great when it comes to structuring millions and millions of lines. So I have rewritten my SAS code into SQL code instead. I will now create all my data tables (used in analysis) by SQL code into my database, this have increased the speed of creating my tables.

For analysis I will use the free edition of SAS (University edition), it is basically made for training so it has some limitations. One of the things I stumbled into was the issue of connecting my database to SAS and reading directly from it. It is NOT featured in the free edition, but it can be overruled by a few simple tricks. Download the PC Files Server for SAS, run it in the background and libname it with something like:
LIBNAME z PCFILES SERVER=computername port=9621 DSN=raw USER=username PASSWORD=xx;

This unlocks the free edition to actually being useful for big volume data analysis. As I get more and more data I also have to sharpen the infrastructure so that it’s manageable in size and time.

As I a last improvement I am going to write a simulation module in VBA costume made for my needs, sometimes it’s just not possible to fit all my needs into a fabricated software.

I have now spent many hours to create some analysis tables which are exactly the same as earlier, not as funny as improving models and looking for areas with edge but with the improved structure I will be in a better position when I start to analyse. Do the hard work and hopefully get the benefits later 🙂






Big data challenges


As a bot runner I have the wonderful possibility to collect a lot of data. As an analyst I can not say no to data, it is the engine in all of my analysis – and a bigger engine is always better! In practise this means that I collect all available data.

What does it mean? Lets go through some figures:

1. I follow up to 500 matches simultaneously
2. Each match is recorded at least once a minute, during some circumstances they are recorded once every 5 seconds.
3. Every record generates one line and 143 columns with different types of data.


I load all of my main data into one big table (as we speak its about 26 GB big with about 26 million rows).

In my current hardware set up I run into problems now and then. When following a lot of matches (+200) and when many of them are in the 5 seconds update phase the writing of data into the PostgreSQL database is not fast enough. At those times my bot can not write every 5 seconds, sometimes the update frequency drops to 10-15 seconds instead! This is not good enough and I need to improve the performance.

What are my options?

1. Improving my code – Put more advanced routines to handle and write data in the background in my bot code. I am not really a big fan of this as my expertise in coding is on amateur level, but I will try to do some improvements.
2. Rearranging my PostgreSQL database – This I will need to do. I will partly create a new structure, keeping more tables but with smaller size.
3. Upgrading hardware – This I will also do. I don’t think that the problem is in the processor, but rather in the hard disk. So I will change my HDD into a SSD, it writes data much faster.
4. Moving to VPN – Might be an option in the future, if my bot performs OK there are several benefits to not use my basement located home computer any more, such as speed and stability.