Covariance Matrix for N-Asset Portfolio fed by Nasdaq Data Link in Python

A construction of your quantitative workshop in Python requires a lot of coding or at least spending a considerable amount of time assembling different blocks together. There are many simple fragments of code reused many times. The calculation of covariance matrix is not a problem once NumPy is engaged but the meaning is derived once you add some background idea what you try to achieve.

Let’s see today how to use data provider in construction of any $N$-Asset Portfolio based on SEC securities and for return-series we may calculate a corresponding covariance matrix.




NASDAQ Data Link and SEC Stock List

NASDAQ Data Link is a great source of data. With their ambition to become the largest data provider on the planet free of charge, no doubt they do an amazing job. You can use their Python API to feed your code directly with Open, High, Low, Close for any SEC stock. In the beginning we will need a list of companies (tickers) and, unfortunately, the corresponding internal call-tickers as referred to by NASDAQ Data Link. The .csv file containing all information you can download from NASDAQ Data Link website. The file contains the following header:

Ticker       Name       Sector       Industry       Price       Collection

where we are interested in matching Ticker with Price field. The latter for AAPL stock displays “WIKI/AAPL” code. That’s all we need for now to grab.

Our Portfolio

Let’s say we have a freedom of choice to select any of 2277 stock data from our SEC universe (provided in secwiki_tickers.csv file). For the sake of simplicity I’ll select ony three and save them in a plain text file of portfolio.lst containing:

AAPL
IBM
TXN

I do it on purpose in order to show you how easily we can read in this list from a file in Python. As usual, we start our adventure from data pre-processing part:

# Covariance Matrix for N-Asset Portfolio fed by NASDAQ Data Link in Python
# (c) 2014 QuantAtRisk, by Pawel Lachowicz

import Quandl
import numpy as np
import pandas as pd

df=pd.read_csv('secwiki_tickers.csv')

dp=pd.read_csv('portfolio.lst',names=['pTicker'])
pTickers=dp.pTicker.values  # converts into a list

In order to install Quandl module in your Mac/Linux environment simply type pip3 install Quandl (for more information see here). In the code above, we employ pandas’ read_csv function both for reading in the data from .csv file as well as from a plain text file. For the latter, line #10, we add a name of the column, pTicker, to point at portfolio tickers, and next we convert pandas’ DataFrame object into a Python list.

Now, we gonna use the power of Python over Matlab in search for the corresponding Quandl (Price) ticker code:

tmpTickers=[]
for i in range(len(pTickers)):
    test=df[df.Ticker==pTickers[i]]
    if not(test.empty):
        tmp=test.Price.values+'.4'  # of <type 'numpy.ndarray'>
        tmp2=tmp.tolist()
        tmpTickers.append(tmp2)

print(tmpTickers)

This is executed in line #15 almost automatically. The result of the search is the DataFrame record (empty or containing the corresponding information on the stock as read out from (df) .csv data source. Simply, if the ticker of the security in our portfolio.lst file does not exist, it is skipped. Quandl allows you to retrieve information on stock’s Open, High, Low, Close, Volume by calling (Price) Quandl ticker in the following form, respectively:

'WIKI/AAPL.1'
'WIKI/AAPL.2'
'WIKI/AAPL.3'
'WIKI/AAPL.4'
'WIKI/AAPL.5'

Below we will stick to Close prices. That is why, in line #17, we add ‘.4’ to the string. Please note that we should get the same data by calling ‘GOOG/NASDAQ_AAPL.4’ and ‘WIKI/AAPL.4’ Quandl tickers. Because a variable tmp is of numpy.ndarray type, in line #18, we convert it into a list type. The final list of tmpTickers contains therefore all corresponding Quandl tickers in the form:

[['WIKI/AAPL.4'], ['WIKI/IBM.4'], ['WIKI/TXN.4']]

and before they can be used for return-series retrieval from the NASDAQ Data Link database, we need to scalp each item a bit in the following way:

tmp=[]
for i in range(len(tmpTickers)):
    tmp2=str(tmpTickers[i]).strip('[]')
    print(tmp)
    tmp.append(str(tmp2).strip('\'\''))

QuandlTickers=tmp
print(QuandlTickers)

what returns:

['WIKI/AAPL.4', 'WIKI/IBM.4', 'WIKI/TXN.4']

Return-Series Download

We fetch the data pretty easily:

data= Quandl.get(QuandlTickers, authtoken='YourAuthToken', \
                 trim_start='2014-10-01', trim_end='2014-11-04', \
                 transformation='rdiff')

d=data.values.T
print(d)

where rdiff enforces price-series transformation into return-series expressed as row-on-row % change, y'[t] = (y[t] – y[t-1])/y[t-1] (see more here: Using the Quandl API). By default we download daily returns:

[[-0.01558313  0.00725953 -0.0028028   0.         -0.00873319  0.02075949
   0.00218254 -0.00287072 -0.00913333 -0.01062018 -0.01225316 -0.01312282
   0.01464783  0.02139859  0.0271652   0.00507466  0.01786581  0.00372031
  -0.00104543  0.01550756  0.00562114 -0.00335383  0.00953449  0.01296296
  -0.00731261]
 [-0.01401254 -0.00138911  0.0094163   0.0019611  -0.01761532  0.0196543
  -0.01552598 -0.00262847 -0.01296187  0.00152572 -0.01115343 -0.01050894
   0.0122887  -0.0711343  -0.03471319 -0.00882191  0.00241053 -0.0006166
  -0.00129566  0.01068759 -0.00085575  0.00544476  0.00030423 -0.00024331
  -0.01040399]
 [-0.01698469         nan         nan -0.00416489 -0.01319035  0.020213
  -0.01959949 -0.07127336 -0.0189518   0.00906272  0.01063578  0.01941066
   0.00183528  0.01694527  0.05314118 -0.00320718  0.00815101  0.01212766
   0.00798823  0.01147028 -0.00350515 -0.01655287  0.0448138   0.00845751
   0.00638978]]

You may notice some missing data in the form denoted by nan. It may happen for any dataset. Life is not perfect. Quandl is only an option. In the first approximation we may fill the missing returns with zeros using pandas:

for i in range(d.shape[0]):
    df=pd.DataFrame(d[i])
    df.fillna(0,inplace=True)
    d[i]=df.values.T

print(d)

what works quickly and efficiently:

[[-0.01558313  0.00725953 -0.0028028   0.         -0.00873319  0.02075949
   0.00218254 -0.00287072 -0.00913333 -0.01062018 -0.01225316 -0.01312282
   0.01464783  0.02139859  0.0271652   0.00507466  0.01786581  0.00372031
  -0.00104543  0.01550756  0.00562114 -0.00335383  0.00953449  0.01296296
  -0.00731261]
 [-0.01401254 -0.00138911  0.0094163   0.0019611  -0.01761532  0.0196543
  -0.01552598 -0.00262847 -0.01296187  0.00152572 -0.01115343 -0.01050894
   0.0122887  -0.0711343  -0.03471319 -0.00882191  0.00241053 -0.0006166
  -0.00129566  0.01068759 -0.00085575  0.00544476  0.00030423 -0.00024331
  -0.01040399]
 [-0.01698469  0.          0.         -0.00416489 -0.01319035  0.020213
  -0.01959949 -0.07127336 -0.0189518   0.00906272  0.01063578  0.01941066
   0.00183528  0.01694527  0.05314118 -0.00320718  0.00815101  0.01212766
   0.00798823  0.01147028 -0.00350515 -0.01655287  0.0448138   0.00845751
   0.00638978]]

Having data is better than missing them. It is important to check how many nans every return-series has. In our case of TXN missing information constitutes 8% at 25 data points. A good indicator is to have less than 3%. Otherwise an impact on any further calculations may be significantly propagated. In this point it is good to recall Bruce Lee who said: “Empty your mind, be formless, shapeless, like water. If you put water into a cup, it becomes the cup. If you put water in a bottle, it becomes the bottle. If you put it in a teacup, it becomes the teacup. Now, water can flow, and it can crush. Be water, my friend.” Two added zeros contribute to the TXN distribution of returns in its central part, therefore the more we “fill in” with zeros, a fit of normal distribution flattens the tails (a distribution becomes more peaked). Therefore, in such case, I would agree with Bruce only on the “crush” part of his wisdom.

Covariance Matrix

Keeping above short note on some dirty tricks in mind, we obtain the desired covariance matrix simply and painfully,

covmat=np.cov(d)
print(covmat)

id est,

[[  1.44660397e-04  -2.13905277e-05   1.31742330e-04]
 [ -2.13905277e-05   3.12211511e-04  -5.62146677e-05]
 [  1.31742330e-04  -5.62146677e-05   5.44348868e-04]]

what accomplishes our efforts.

Related QaR Posts:
   2-Asset Portfolio Construction
   Create a Portfolio of Stocks based on Google Finance Data fed by Quandl




Leave a Reply

Your email address will not be published. Required fields are marked *