Rebinning Tick Data for FX Algo Traders

If you work or intend to work with FX data in order to build and backtest your own FX models, the Historical Tick-Data of Pepperstone.com is probably the best place to kick off your algorithmic experience. As for now, they offer tick-data sets of 15 most frequently traded currency pairs since May 2009. Some of the unzip’ed files (one month data) reach over 400 MB in size, i.e. storing 8.5+ millions of lines with a tick resolution for both bid and ask “prices”. A good thing is you can download them all free of charge and their quality is regarded as very high. A bad thing is there is 3 month delay in data accessibility.

Dealing with a rebinning process of tick-data up, that’s a different story and the subject of this post. We will see how efficiently you can turn Pepperstone’s Tick-Data set(s) into 5-min time-series as an example. We will make use of scripting in bash (Linux/OS X) supplemented with data processing in Python.

Data Structure

You can download Pepperstone’s historical tick-data from here, month by month, pair by pair. Their inner structure follows the same pattern, namely:

$ head AUDUSD-2014-09.csv 
AUD/USD,20140901 00:00:01.323,0.93289,0.93297
AUD/USD,20140901 00:00:02.138,0.9329,0.93297
AUD/USD,20140901 00:00:02.156,0.9329,0.93298
AUD/USD,20140901 00:00:02.264,0.9329,0.93297
AUD/USD,20140901 00:00:02.265,0.9329,0.93293
AUD/USD,20140901 00:00:02.265,0.93289,0.93293
AUD/USD,20140901 00:00:02.268,0.93289,0.93295
AUD/USD,20140901 00:00:02.277,0.93289,0.93296
AUD/USD,20140901 00:00:02.278,0.9329,0.93296
AUD/USD,20140901 00:00:02.297,0.93288,0.93296

The columns, from left to right, represent respectively: a pair name, the date and tick-time, the bid price, and the ask price.

Pre-Processing

Here, for each .csv file, we aim to split the date into year, month, and day separately, and remove commas and colons to get raw data ready to be read in as a matrix (array) using any other programming language (e.g. Matlab or Python). The matrix is mathematically intuitive data structure therefore making direct reference to any specific column of it makes any backtesting engine running with its full thrust.

Let’s play with AUDUSD-2014-09.csv data file. Working in the same directory where the file is located we begin with writing a bash script (pp.scr) that contains:

# pp.scr
# Rebinning Pepperstone.com Tick-Data for FX Algo Traders 
# (c) 2014 QuantAtRisk, by Pawel Lachowicz

clear
echo "..making a sorted list of .csv files"
for i in \$1-*.csv; do echo \${i##$1-} \$i \${i##.csv};
done | sort -n | awk '{print $2}' > $1.lst

python pp.py
head AUDUSD.pp

that you run in Terminal:

$ chmod +x pp.scr
$ ./pp.scr AUDUSD

where the first command makes sure the script becomes executable (you need to perform this task only once). Lines #7-8 of our script, in fact, look for all .csv data files in the local directory starting with AUDUSD- prefix and create their list in AUDUSD.lst file. Since we work with AUDUSD-2014-09.csv file only, the AUDUSD.lst file will contain:

$ cat AUDUSD.lst 
AUDUSD-2014-09.csv

as expected. Next, we utilise the power and flexibility of Python in the following way:

# pp.py
import csv

fnlst="AUDUSD.lst"
fnout="AUDUSD.pp"

for lstline in open(fnlst,'r').readlines():
    fncur=lstline[:-1]
    #print(fncur)

    with open(fnout,'w') as f:
        writer=csv.writer(f,delimiter=" ")

        i=1 # counts a number of lines with tick-data
        for line in open(fncur,'r').readlines():
            if(i<=5200): # replace with (i>0) to process an entire file
                #print(line)
                year=line[8:12]
                month=line[12:14]
                day=line[14:16]
                hh=line[17:19]
                mm=line[20:22]
                ss=line[23:29]
                bidask=line[30:]
                writer.writerow([year,month,day,hh,mm,ss,bidask])
                i+=1

It is a pretty efficient way to open really a big file and process its information line by line. Just for further purpose of display, in the code we told computer to process only first 5,200 of lines. The output of lines #10-11 of pp.scr is the following:

2014 09 01 00 00 01.323 "0.93289,0.93297
"
2014 09 01 00 00 02.138 "0.9329,0.93297
"
2014 09 01 00 00 02.156 "0.9329,0.93298
"
2014 09 01 00 00 02.264 "0.9329,0.93297
"
2014 09 01 00 00 02.265 "0.9329,0.93293
"

since we allowed Python to save bid and ask information as one string (due to a variable number of decimal digits). In order to clean this mess we continue:

# pp.scr (continued)
echo "..removing token: comma"
sed 's/,/ /g' AUDUSD.pp > $1.tmp
rm AUDUSD.pp

echo "..removing token: double quotes"
sed 's/"/ /g' $1.tmp > $1.tmp2
rm $1.tmp

echo "..removing empty lines"
sed -i '/^[[:space:]]*$/d' $1.tmp2
mv $1.tmp2 AUDUSD.pp

echo "head..."
head AUDUSD.pp
echo "tail..."
tail AUDUSD.pp

what brings us to pre-processed data:

..removing token: comma
..removing token: double quotes
..removing empty lines
head...
2014 09 01 00 00 01.323  0.93289 0.93297
2014 09 01 00 00 02.138  0.9329 0.93297
2014 09 01 00 00 02.156  0.9329 0.93298
2014 09 01 00 00 02.264  0.9329 0.93297
2014 09 01 00 00 02.265  0.9329 0.93293
2014 09 01 00 00 02.265  0.93289 0.93293
2014 09 01 00 00 02.268  0.93289 0.93295
2014 09 01 00 00 02.277  0.93289 0.93296
2014 09 01 00 00 02.278  0.9329 0.93296
2014 09 01 00 00 02.297  0.93288 0.93296
tail...
2014 09 02 00 54 39.324  0.93317 0.93321
2014 09 02 00 54 39.533  0.93319 0.93321
2014 09 02 00 54 39.543  0.93318 0.93321
2014 09 02 00 54 39.559  0.93321 0.93321
2014 09 02 00 54 39.784  0.9332 0.93321
2014 09 02 00 54 39.798  0.93319 0.93321
2014 09 02 00 54 39.885  0.93319 0.93325
2014 09 02 00 54 39.886  0.93319 0.93321
2014 09 02 00 54 40.802  0.9332 0.93321
2014 09 02 00 54 48.829  0.93319 0.93321

Personally, I love that part as you can learn how to do simple but necessary text file operations by typing single lines of Unix/Linux commands. Good luck for those who try to repeat the same in Microsoft Windows not spending more than 30 sec for doing it.

Rebinning: 5-min Data

The rebinning has many schools. It’s the art for some people. We just want to have the job done. I opt for simplicity and understanding of the data we deal with. Imagine we have two adjacent 5 min bins with a tick history of trading:

sam
We want to derive the closest possible (or most fair) price estimation every 5 min, denoted in the above painting by a red marker. The old-school approach is to take the average over a number (larger than 5) of tick data points from the left and from the right. That creates the under- or overestimation of the mid-price.

If we trade live, every 5 min we receive an information on the last tick point before the minute hits 5 and we wait for the next tick point after 5 (blue markers). Taking the average of their prices (mid-price) makes most of sense. The precision we look at here is sometimes $10^{-5}$. It is not much of significance if our position is small, but if it is not, the mid-price may start playing a crucial role.

The cons of the old-school approach: a possible high volatility among all tick-data within last 5 minutes that we neglect.

The following Python code (pp2.py) performs 5-min rebinning for our pre-processed AUDUSD-2014-09 file:

# pp2.py
import csv
import numpy as np

def convert(data):
     tempDATA = []
     for i in data:
         tempDATA.append([float(j) for j in i.split()])
     return np.array(tempDATA).T

fname="AUDUSD.pp"

with open(fname) as f:
    data = f.read().splitlines()

#print(data)

i=1
for d in data:
    list=[s for s in d.split(' ')]
    #print(list)
    # remover empty element in the list
    dd=[x for x in list if x]
    #print(dd)
    tmp=convert(dd)
    #print(tmp)
    if(i==1):
        a=tmp
        i+=1
    else:
        a = np.vstack([a, tmp])
        i+=1

N=i-1
#print("N = %d" % N)

# print the first line
tmp=np.array([a[1][0],a[1][1],a[1][2],a[1][3],a[1][4],0.0,(a[1][6]+a[1][7])/2])
print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
             (tmp[0],tmp[1],tmp[2],tmp[3],tmp[4],tmp[5],tmp[6]))
m=tmp

# check the boundary conditions (5 min bins)
for i in xrange(2,N-1):
    if( (a[i-1][4]%5!=0.0) and (a[i][4]%5==0.0)):

        # BLUE MARKER No. 1
        # (print for i-1)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i-1][0],a[i-1][1],a[i-1][2],a[i-1][3],a[i-1][4],a[i-1][5],a[i-1][6],a[i-1][7]))
        b1=a[i-1][6]
        b2=a[i][6]
        a1=a[i-1][7]
        a2=a[i][7]
        # mid-price, and new date for 5 min bin
        bm=(b1+b2)/2
        am=(a1+a2)/2
        Ym=a[i][0]
        Mm=a[i][1]
        Dm=a[i][2]
        Hm=a[i][3]
        MMm=a[i][4]
        Sm=0.0        # set seconds to zero

        # RED MARKER
        print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" %
              (Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2))
        tmp=np.array([Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2])
        m=np.vstack([m, tmp])

        # BLUE MARKER No. 2
        # (print for i)
        #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" %
        #      (a[i][0],a[i][1],a[i][2],a[i][3],a[i][4],a[i][5],a[i][6],a[i][7]))

what you run in pp.scr file as:

# pp.scr (continued)

python pp2.py > AUDUSD.dat

in order to get 5-min rebinned FX time-series as follows:

$ head AUDUSD.dat
2014  9  1  0  0  0.000   0.932935
2014  9  1  0  5  0.000   0.933023
2014  9  1  0 10  0.000   0.932917
2014  9  1  0 15  0.000   0.932928
2014  9  1  0 20  0.000   0.932937
2014  9  1  0 25  0.000   0.933037
2014  9  1  0 30  0.000   0.933075
2014  9  1  0 35  0.000   0.933070
2014  9  1  0 40  0.000   0.933092
2014  9  1  0 45  0.000   0.933063

That concludes our efforts. Happy rebinning!

4 comments
  1. Nice article , as always, Pawel!
    Are you using bash because is fastest than processing all with Python or only to show us what can we do with this tool? I’m trying to store FX data in a database (MySQL at this moment) and I don’t know which form to store it i should be the best. I’m wondering if it should be better do a table “fxdata” and then inside this database create one table for each pair (EURUSD, GBPUSD…) and when I will do a query filter by time frame to get all the data of each one or in other case create different tables for each time frame, EURUSDM1, EURUSDM5, etc. Could you give some tips?

    Regards

    1. It’s obvious you can do everything in Python, but I’m a Linux lover and bash is my second nature. From educational point of view, doing many things in bash is really cool that is why I brought up these couple of ready-to-use solutions for text file processing.

Leave a Reply

Your email address will not be published. Required fields are marked *