In quantitative finance we are used to measuring direct linear correlations or non-linear cross-bicorrelations among various time-series. For the former, by default, one adopts the calculation of Pearson product-moment correlation coefficients to quantify a linear relationship between two vectors. This is true if the the data follow Gaussian distribution. In other case, the rank correlation methods need to be applied (e.g. Spearman’s or Kendall’s). A good diversification of assets kept in the investment portfolio often benefits from correlation measures. We want to limit the risk of losing too much due to highly correlated (co-moving in the same direction) assets. While correlation measures of any kind are powerful tools in finance, can it be something better than that?
Remarkably, this is a quantitative biology that delivers new weapon to the table. Biologists love in-depth data analyses and devoted lots of time to the studies of biological samples. The samples can be far different in their origin or composition, however when it comes to counting, comparing and classifying, the language of mathematics standing behind has, luckily, the same denominator with the analysis of financial data samples.
In biology people are more inclined towards talking about similarities, thus similarity metrics. Similarity metrics — also referred to as correlation metrics — are applied to two or more objects (e.g. DNA sequences, etc.). The similarity metric will quantify an association the objects have with each other. This quantification could be a variety of measurements, such as how often the objects are involved in a similar process, how likely the objects are to appear in the same location, etc. The value representing the quantified correlation is often referred to as the similarity coefficient, or the correlation coefficient. This similarity coefficient is a real-valued number that describes to what extent the objects are related.
In this article we will have a closer look at a similarity measure used commonly in biology for vector classifications $-$ Czekanowski Index. We start from the theory of Pearson’s and Spearman’s correlation coefficients, their assumptions and limitations. Next, we introduce similarity measure as an alternative tool in measuring correlation between two vectors (data samples) and show how to adopt it to financial N–Asset Portfolio analysis. We develop Python code for a clear and concise application and illustrate the functionality based on 39-Crypto-Asset Portfolio during micro flash-crash in May 2021. We conclude that proposed by us the Average Rolling Similarity measure may be highly regarded as a strong alternative to classical correlation measures, beating their performance in certain market conditions.
1. Galton-Pearson Product-Moment Correlation
The history of what we know today under a common term of correlation can be dated back to 1885. In this year, Sir Francis Galton, the British biometrician, as the first one referred to regression. He published a bivariate scatterplot with normal isodensity lines — the first graph of correlation:
In 1888, Galton noted that $r$ measures the closeness of the “co-relation” and suggested that $r$ could not be greater than 1 (although he had not yet recognised the idea of negative correlation). Seven years later, Pearson (1895) developed the mathematical formula that is still most commonly used to measure correlation, the Pearson product-moment correlation coefficient. From a historical point of view, it seems that more appropriate that the name of $r$ should be the Galton-Pearson $r$:
$$
r = \frac{\sum_{i=1}^n (X_i – \bar{X})(Y_i – \bar{Y})}{ \sqrt{ \sum_{i=1}^n (X_i – \bar{X})^2 \sum_{i=1}^n (Y_i – \bar{Y})^2 } }
$$ where $\bar{X}$ and $\bar{Y}$ are the expected values of the samples $\mathbf{X}=\{X_1,…,X_n\}$ and $\mathbf{Y}=\{Y_1,…,Y_n\}$, respectively. As one can notice here, the formula describes $r$ as the centered and standardised sum of cross-product of two variables. In Lord and Novick (1968) we could found one of the first hint that the application of the Cauchy-Schwartz inequality might be used to show that the absolute value of the numerator is less than or equal to the denominator. In other words $−1 \le r \le 1$. Given that, the Galton-Pearson $r$ can be interpreted as the strength of the linear relationship between two variables.
A perfect correlation (anti-correlation) is obtained for $r$ equal to $+1$ and $-1$, respectively, and the data points lie on the straight line. Both extreme values are approached when this relationship between two variables tends to tighten. Interestingly, $r$ can be calculated as a measure of a linear relationship without any assumptions. In reality, all depends on the data sample we analyse, thus requires some minimum level of assumptions:
-
1. Say, if $\mathbf{X}$ does not represent the whole population (due to random or not uniform selection) then the derived $r$ might be biased or incorrect.
-
2. Both variables $\mathbf{X}$ and $\mathbf{Y}$ are continuous, jointly normally distributed, random variables. They follow a bivariate normal distribution in the population from which they were sampled.
That is why, it is so important to check the normality of both samples before calculation of the $r$ coefficient. Why? Because if there is a relationship between jointly normally distributed data, this relationship is always linear. Therefore, if what you observe in a scatter plot seems to lie close to some curve (i.e. not exactly a straight line), the assumption of a bivariate normal distribution is violated.
In the literature, various stratifications for the $r$ have been published on many occasions. It is safe to say that if $0.00\lt r \lt 0.10$ the linear correlation is negligible, if $0.10\le r \lt 0.40$ it is weak, if $0.40\le r \lt 0.70$ it is moderate, if $0.70\le r \lt 0.90$ it is strong, and if $0.90\le r \lt 1.00$ it is very strong (in terms of absolute magnitude, of course). However, due to the fact that samples can be inevitably affected by chance, the observed correlation may also not necessarily be a good estimate for the population correlation coefficient.
Therefore, it is additionally advised that the derived coefficient of correlation $r$ should be always accompanied by a confidence interval, which provides the range of plausible values of the coefficient in the population from which the data were sampled. By specifying the confidence level at $1-\alpha$, e.g. $95\%$, where $\alpha$ denotes the significance level, in Python we can derive both Pearson’s $r$ and the corresponding confidence brackets as follows:
from scipy.stats import pearsonr, norm import numpy as np np.random.seed(1) for n in [10, 100, 1000, 10000]: x = np.random.randn(n) y = np.random.randn(n) r1, r2 = np.corrcoef(x,y)[0,1], pearsonr(x,y)[0] if r1 == r1: r = r1 else: r = None r_z = np.arctanh(r) se = 1/np.sqrt(n-3) cl = 0.95 # confidence level alpha = 1 - cl z = norm.ppf(1-alpha/2) lo_z, hi_z = r_z-z*se, r_z+z*se lo, hi = np.tanh((lo_z, hi_z)) print('n = %5g: r = %.10f [%13.10f,%13.10f]' % (n,r, lo, hi))
where $\mathbf{X}$ and $\mathbf{Y}$ samples, in this example, are drawn randomly from the standard Normal distribution and the results are tested in function of growing sample sizes:
n = 10: r = 0.6556177144 [ 0.0442628668, 0.9097178165] n = 100: r = 0.1659786467 [-0.0314652772, 0.3509551991] n = 1000: r = 0.0216530956 [-0.0403942097, 0.0835340468] n = 10000: r = 0.0072949196 [-0.0123069100, 0.0268911447]
This brings us to key observation we need to remember here. For samples of size $n=10$ and $n=10000$, respectively, the $95\%$ confidence intervals introduce different degrees of uncertainty:
$$
r(n=10) = 0.6556^{+0.2541}_{-0.6114} \ \ \ \ r(n=10000) = 0.0073^{+0.0196}_{-0.0196} \ \ .
$$ Therefore, the estimation of $r$ can be very misleading if given without the corresponding confidence brackets.
In the code, we used a fact that a 95% confidence interval is given by
$$
\tanh \left[\text{arctanh}(r) \pm \frac{1.96}{\sqrt{n-3}} \right]
$$ where $\text{arctanh}$ is the Fisher transformation. Once transformed, the sampling distribution of the estimate is approximately normal, so a 95% CI is found by taking the transformed estimate and adding and subtracting 1.96 times its standard error. The standard error is approximately equal to $(n-3)^{-1/2}$.
2. Spearman’s Rank-Order Correlation
In 1904 Charles Spearman developed a nonparametric version of the Pearson correlation, often referred to as Spearman’s $\rho$, which can be expressed as:
$$
\rho = \frac{\sum_{i=1}^n (R_i – \bar{R})(Q_i – \bar{Q})}{ \sqrt{ \sum_{i=1}^n (R_i – \bar{R})^2 \sum_{i=1}^n (Q_i – \bar{Q})^2 } }
$$ where the samples of $\mathbf{X}$ and $\mathbf{Y}$ have been replaced with the corresponding rank vectors of $\mathbf{R}$ and $\mathbf{Q}$ of length $n$. The trick here is that given $n$ elements of the vector $\mathbf{X}$, we rank them, say, from $1$ to $n$ where $1$ is the highest rank (score) and $n$ is the lowest one. In the case when we have two (or more) identical values, we need to take the average of the ranks that they would have otherwise occupied.
A consideration of correlation $\rho$ based on rank vectors frees us from the examination of actual values in $\mathbf{X}$ and $\mathbf{Y}$ samples. In contrast to $r$, the coefficient $\rho$ quantifies strictly monotonic relationships between two variables. Ranking converts a nonlinear strictly monotonic relationship to a linear relationship what, in addition, helps to eliminate a bad influence of nasty outliers present in $\mathbf{X}$ or $\mathbf{Y}$.
According the most recent review of Bonnett and Wright (2000) the ranks used in Spearman’s method clearly do not follow a Normal distribution. The consequence of that is that the variance of the Fisher transformation $(\zeta)$ is not well approximated by $(n-3)^{-1}$. The following estimator of the variance has been proposed:
$$
\sigma_{\zeta}^2 = \frac{1+\rho^2/2}{n-3}
$$ where $(1-\alpha)$ confidence level for Spearman’s $\rho$ can be estimated as:
$$
\tanh \left[\text{arctanh}(\rho) \pm \sqrt{ \frac{1+\rho^2/2}{n-3} } z_{\alpha/2} \right]
$$ and $z_{\alpha/2}$ is the $\alpha/2$-quantile of the Standard Normal distribution.
It is needless to say that Spearman’s correlation coefficient is $-1 \le \rho \le 1$ where $-1,0,1$ represent a strong negative, very weak, and very strong positive correlation in monotonicity of the two vectors of $\mathbf{X}$ and $\mathbf{Y}$, respectively.
3. Czekanowski Index-based Similarity Measure
In order to quantify the likeness between two biological samples, Jan Czekanowski (1909, 1913) developed a metric that had been used to quantify the amount of set intersection two (or more) vectors may have with each other. This metric is known today as Czekanowski Index but also as a proportional similarity index and is a quantitative version of a presence-absence similarity index (or Sørensen index).
Given two vectors $\mathbf{X}$ and $\mathbf{Y}$, the Czekanowski Index is defined as:
$$
c = \frac{2 \sum_{i=1}^n \min(X_i, Y_i) } { \sum_{i=1}^n X_i+Y_i }
$$ and takes values between $0 \le c \le 1$. The interpretation of $c$ close to unity is that both samples have a substantial overlap while when $c\rightarrow 0$ there is a very little overlap.
Now, what should we understand by a good or poor overlap between two samples $\mathbf{X}$ and $\mathbf{Y}$? Consider the following two Python examples. In the first one,
def czekanowski_index(x,y): if np.size(x) == np.size(y): return 2 * np.sum(np.minimum(x,y)) / np.sum(x + y) else: return None par2 = 1 for par1 in [0, 0.1, 1, 10, 100, 1000]: x1 = np.arange(10) y1 = par2 * x1 + par1 plt.plot(x1,y1,'.-', label='par$_1$ = ' + str(par1) + ', c = ' + str(np.round(czekanowski_index(x1, y1),5))) plt.xlabel('$x_1$'); plt.ylabel('$y_1$') plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")
we examine Czekanowski index-based similarity measure of $c$ by making $y_1$ variable linearly depended on $x_1$ variable and shifted by a constant (par$_1$). It is clearly visible
that $c=1$ is obtained while a perfect overlap (i.e. an equality of all underlying values in both vectors) is provided. The farer from that ideal situation the lower similarity is being observed.
We can deduce that Czekanowski index measures indeed a similarity feature between two vectors $\mathbf{X}$ and $\mathbf{Y}$. It does not assume anything about the distribution of both samples. It simply describes a degree how much they are alike which is, in fact, true to say how much they are “correlated” (but not in a Pearson sense). Therefore, to allow ourselves to perform a meaningful comparison of two time-series aimed at the calculation of $c$, we cannot use raw data (e.g. price time-series of different assets). Since the prices will be at different levels (they usually always are), we need to apply a data transformation before calculation of index-based similarity measure:
$$
\mathbf{X}’ = (\mathbf{X} – \bar{X}) / \sigma_{X}
$$ $$
\mathbf{Y}’ = (\mathbf{Y} – \bar{Y}) / \sigma_{Y}
$$ where $\bar{X}$ and $\bar{Y}$ are the expected values of both vectors, and $\sigma_{X}, \sigma_{Y}$ denote their sample standard deviations (i.e. with one degree of freedom).
Lastly, a not obvious feature of the Czekanowski index similarity measure is the fact that it was designed solely with positive-valued elements of $\sigma_{X} and \sigma_{Y}$ vectors. If not, it may lead to $c$ being outside $[0,1]$ bracket. In order to avoid this situation, an additional data transformation is required in our case, i.e.
$$
\mathbf{X}^{”} = \mathbf{X}’ – \min(\mathbf{X}’)
$$ $$
\mathbf{Y}^{”} = \mathbf{Y}’ – \min(\mathbf{Y}’)
$$ which bring both samples to the same lower level.
Let’s analyse the following case study which illustrates that process best.
4. Correlation vs Similarity: 2-Crypto-Asset Portfolio Case Study
As usual, any theory remains only a nice theory until we frame it into its functional form. In this case study we will be working with pre-downloaded 1-min close price-series of 52 cryptocurrencies recorded between 2021-05-18 22:29 and 2021-05-23 22:29. The data are available in a file for your own inspection and experiments with the code.
In pre-processing phase, we reduce the initial data sample down to 39 crypto time-series, i.e. to data that are sufficiently good for our further analysis. In other words, we can construct 39-asset portfolio and examine, inter alia, mutual correlation among all time-series. First things first:
# Similarity Index-Based Matrixes as the Alternative Correlation Measures # in N-Asset Portfolio Analysis # # (c) 2021 QuantAtRisk.com import numpy as np import numpy.ma as ma import pandas as pd import pickle import matplotlib.dates as mdates import matplotlib.pyplot as plt from mpl_toolkits.axes_grid1 import make_axes_locatable from matplotlib.dates import DateFormatter # load time-series from database with open('timeseries_1m_20210518_20210523.db', 'rb') as handle: database = pickle.load(handle) # perform a handy data pre-processing fi = True for coin in database.keys(): # construct a DataFrame using LEFT JOIN try: ts = database[coin] if ts.shape[0] > 1000: # at least 1000 data points if fi: d = ts fi = False else: d = pd.merge(left=d, right=ts, how='left', left_index=True, right_index=True) except: print(coin + ' ...skipped') # extra data filtering; skipping five last coins d = d.iloc[:,0:d.columns.size-5] print(d.columns.tolist())
where the last command prints all crypto-currency pairs at our disposal, i.e.
['BTCUSD', 'ETHUSD', 'ICPUSD', 'UNIUSD', 'BCHUSD', 'LTCUSD', 'LINKUSD', 'XLMUSD', 'ETCUSD', 'EOSUSD', 'FILUSD', 'AAVEUSD', 'MKRUSD', 'XTZUSD', 'ATOMUSD', 'COMPUSD', 'DASHUSD', 'YFIUSD', 'ZECUSD', 'SNXUSD', 'SUSHIUSD', 'GRTUSD', 'ENJUSD', 'BNTUSD', 'UMAUSD', 'OMGUSD', 'ZRXUSD', 'CRVUSD', 'LRCUSD', 'RENUSD', '1INCHUSD', 'KNCUSD', 'SKLUSD', 'STORJUSD', 'REPUSD', 'OXTUSD', 'CTSIUSD', 'BALUSD', 'NMRUSD']
Now, let’s select two pairs and store their data in a separate pandas’ DataFrame, e.g.:
# 2-Asset Portfolio case study # time period selection df = d[(d.index >= '2021-05-19 13:00') & (d.index <= '2021-05-19 14:00')] # V-shape recovery # pair selection dfx = df[['BTCUSD', 'AAVEUSD']] # data normalisation, column-wise dfx_norm = (dfx - dfx.mean())/dfx.std(ddof=1) # zero mean, unit standard deviation # plotting plt.figure(figsize=(7,7)) plt.plot(dfx_norm, '.-') plt.grid() plt.xlabel('Day of May 2021 and Time') plt.ylabel('Normalized Price-Series') plt.legend(dfx_norm.columns, bbox_to_anchor=(1.04,1), loc="upper left")
In this example, we limit time-series to be between 13:00 and 14:00 on May 19th as it occurred that in cryptocurrency trading, nearly all USD pairs suffered a sudden drop in prices followed by a speedy recovery. For BTCUSD and AAVEUSD pairs, the picture presents itself as:
AAVEUSD time-series has three missing data points
x = dfx_norm['BTCUSD'].values y = dfx_norm['AAVEUSD'].values print(x); print(); print(y)
what is clearly marked by nan in the narray:
[ 1.18149577 0.93789057 0.89706829 1.04741679 1.05727853 1.24462026 1.08001835 1.08731697 1.06923452 1.05455913 1.26374985 1.11108829 1.03186619 1.09453745 1.1925453 1.05290248 0.96588164 0.87067135 0.49447189 0.41140512 0.58668147 0.29250165 0.25019464 0.15170231 0.1090671 0.00973081 -0.116862 -0.79042954 -0.68840511 -0.91691295 -1.73895354 -2.93000437 -3.47418155 -1.86453049 -1.22737792 -1.28353199 -1.11771103 -0.39464727 -0.43706367 -0.08794883 0.05750788 -0.19491194 -0.20466427 -0.12823972 -0.27702535 -0.77778589 -0.59150691 -0.74616894 -1.02228193 -0.92372708 -0.59011595 -0.38401972 -0.0125714 -0.10750038 0.24119249 0.20090159 0.33279255 0.22045315 0.19102423 0.19591602 0.04339512] [ 2.03296456 nan 1.73859267 1.69905018 1.53460363 1.67080554 1.36701878 1.27600829 1.10967876 1.10591281 1.19190203 1.02933846 0.87179615 0.72806233 0.70797726 0.5617328 nan 0.36402034 0.04391446 nan 0.12739305 -0.20714898 -0.20777664 -0.56554204 -0.62642492 -0.58186116 -0.48959535 -1.17688151 -1.13859434 -1.22897718 -1.73675298 -2.09012477 -2.850847 -2.03238018 -1.19571127 -0.78459489 -0.53604209 -0.07847897 0.13869091 0.35397781 0.17760574 -0.48896769 -0.26238294 -0.12367039 -0.1644682 -0.65153127 -0.22472342 -0.29878714 -0.47139325 -0.89757344 -0.74881835 -0.26991484 0.06713783 -0.05651092 0.27175453 0.36966927 0.51591372 0.18074403 0.30627575 0.35021185 0.29372258]
The problem with theses gaps is that we are not able to use, in Python, standard functions for calculation of Pearson correlation coefficient, e.g.
# calculating the Pearson correlation coefficients print(np.corrcoef(x,x)[0,1]) # prints 1.0 print(np.corrcoef(x,y)[0,1]) # prints nan
One solution here is to apply masked arrays provided by a numpy.ma library:
import numpy.ma as ma xm = ma.masked_invalid(x) ym = ma.masked_invalid(y) print(xm); print(xm)
masked_array(data=[1.1814957742437018, 0.9378905666419012, 0.8970682915244457, 1.0474167932970262, 1.057278529743852, 1.2446202647488434, 1.0800183499184004, 1.087316972613595, 1.0692345176706763, 1.0545591285683429, 1.2637498454317555, 1.1110882897857455, 1.0318661946209682, 1.0945374515968562, 1.1925452951120192, 1.0529024818752182, 0.9658816442588193, 0.8706713456311101, 0.49447188760950095, 0.41140512181726485, 0.5866814676987163, 0.29250164972933834, 0.25019464408501574, 0.15170230955581698, 0.10906710032134771, 0.009730813703206812, -0.11686199963961426, -0.7904295390777554, -0.6884051087688648, -0.9169129512238609, -1.7389535433480152, -2.9300043719953113, -3.474181553203166, -1.864530488435614, -1.2273779187615235, -1.2835319901616173, -1.1177110334249118, -0.3946472668439693, -0.4370636736850074, -0.08794882622180944, 0.05750787918333031, -0.19491193912530289, -0.20466427437541332, -0.12823972409807832, -0.2770253516318699, -0.7777858864858553, -0.5915069059632487, -0.7461689406349324, -1.0222819324037717, -0.9237270829050195, -0.5901159478907174, -0.38401972202013435, -0.012571401684467895, -0.10750038294920389, 0.24119248846952307, 0.2009015905932542, 0.3327925476056072, 0.2204531473206486, 0.191024225404037, 0.1959160217714777, 0.04339512480696578], mask=[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False], fill_value=1e+20) masked_array(data=[2.032964555674623, --, 1.738592673190532, 1.699050181513565, 1.5346036288252256, 1.670805544601443, 1.3670187831466536, 1.276008286429822, 1.109678757947343, 1.1059128063590566, 1.1919020342915092, 1.0293384573973166, 0.871796149287496, 0.7280623303347062, 0.7079772551972032, 0.5617328018522273, --, 0.3640203434673936, 0.043914458463373315, --, 0.12739305200363962, -0.20714898075546498, -0.2077766393535115, -0.5655420402403589, -0.6264249242509218, -0.5818611637895829, -0.48959534987665804, -1.1768815147382308, -1.138594340257357, -1.2289771783761352, -1.7367529841962375, -2.090124774896752, -2.8508469957298326, -2.032380183876422, -1.1957112726796408, -0.7845948909587959, -0.5360420861321438, -0.0784789681558132, 0.1386909067684842, 0.35397780589864203, 0.1776057398474044, -0.48896769127861156, -0.2623829373836091, -0.12367038721519868, -0.1644681960882656, -0.6515312681728114, -0.22472342150078195, -0.29878713607034324, -0.47139325053329456, -0.8975734386072702, -0.7488183508701083, -0.26991484056017456, 0.06713782659111614, -0.05651091722416343, 0.27175452955446877, 0.36966927084981943, 0.5159137241947882, 0.18074403283763704, 0.3062757524470562, 0.35021185431035573, 0.29372258048611855], mask=[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False], fill_value=1e+20)
Then, the correlation coefficient becomes possible to be derived because for all data pairs having at least one invalid value, they are not taken into account during calculations:
print(ma.corrcoef(ma.masked_invalid(x), ma.masked_invalid(x)).data[0,1]) print(ma.corrcoef(ma.masked_invalid(x), ma.masked_invalid(y)).data[0,1])
prints
1.0 0.9183340720533023
Python Tip: Using SciPy Correlation functions for gapped data
The developers of SciPy library realised how serious the problem of missing data was and they created a dedicate sets of function for masked arrays as well. You can find them all as a part of scipy.stats.mstats and use as follows:
from scipy.stats.mstats import pearsonr, spearmanr r = pearsonr(ma.masked_invalid(x), ma.masked_invalid(y))[0] rho = spearmanr(ma.masked_invalid(x), ma.masked_invalid(y))[0]
Since it all looks good but a bit too complex, in the furhter analysis, we will make use of masked arrays’ output information in order to create a common mask in the following way:
# a common mask cmask = xm.mask + ym.mask
Given that, for two crypto time-series samples of BTCUSD and AAVEUSD, we define a Czekanowski similarity function working on transformed data as we have describe above, i.e.
def czekanowski_norm(x,y): if np.size(x) == np.size(y): x = (x - np.mean(x)) / np.std(x, ddof=1) y = (y - np.mean(y)) / np.std(y, ddof=1) x, y = x - np.min(x), y - np.min(y) return 2 * np.sum(np.minimum(x,y)) / np.sum(x + y) else: return None
what allows us to calculate Pearson and Spearman’s correlation coefficients followed by Czekanowski index-base similarity measure:
r = pearsonr(x[~cmask], y[~cmask])[0] rho = spearmanr(x[~cmask], y[~cmask])[0] c = czekanowski_norm(x[~cmask], y[~cmask]) print(r, rho, c)
0.9183340720533024 0.897136177673875 0.9067373639806205
This results show how closely all three measures are too each other but please keep in mind the $c \in [0,1]$. Both time-series display a very high degree of similarity as well as they are very highly correlated.
Interestingly, by applying a Shapiro-Wilk test to verify or reject sample’s normality
# Shapiro-Wilk test from scipy.stats import shapiro def shapiro_wilk(z): if shapiro(z[~np.isnan(z)]).pvalue > 0.05: return 'normally distributed' else: return 'not normally distributed' print('x: ' + shapiro_wilk(x[~cmask])) print('y: ' + shapiro_wilk(y[~cmask]))
reveals that
x: not normally distributed y: normally distributed
therefore, an extra caution should be taken when presenting Pearson correlation measure here (it violates the assumption of normality; see Sect. 1).
5. N-Asset Portfolio Case Study: Rolling Average Correlation and Similarity Measures
By now, you should have a good understanding of key properties for both correlation measures (Pearson’s and Spearman’s) in contrast to Czekanowski index-based similarity measure based on analysed case study of 2-asset portfolio. In this section let us allow to take one more step by extending these methods to $N$-asset portfolio.
At the beginning we ended up with a selection of 39-asset portfolio. All relevant time-series for a discussed V-shaped crypto flash-crash and speedy recovery could be visualised as follows:
# N-Asset Portfolio Case Study # data selection df = d[(d.index >= '2021-05-19 13:00') & (d.index <= '2021-05-19 14:00')] # v-shape df_norm = (df-df.mean())/df.std(ddof=1) plt.figure(figsize=(11,11)) plt.plot(df_norm) plt.grid() plt.xlabel('Day of May 2021 and Time') plt.ylabel('Normalized Price-Series') plt.legend(df_norm.columns, bbox_to_anchor=(1.04,1), loc="upper left")
In what follows, we will be performing the analysis of all time-series in a rolling window 15-min long with 1 min step. For each window selection (lines #124-127), we will normalise time-series to have zero mean and unit standard deviation (line #130). Our goal would be to fill pre-located empty matrixes of Pearson and Spearman correlations and Czekanowski similarity measure (lines #136-139) with the corresponding derived coefficients (lines #154-156). Therefore, the loop over the lines #141-160 does that job where we have used the approach of masked arrays as discussed in Sect. 4.
from scipy.stats import pearsonr, spearmanr winsize = 15 # minutes avg_pearson, avg_spearman, avg_czekanowski, timestamp = list(), list(), list(), list() for k in range(df.index.size-winsize): # index slice idx = df.index[k:k+winsize] # extract a proper data slice ds = df.loc[idx,:] # normalise ds = (ds - ds.mean()) / ds.std(ddof=1) # size col_names = ds.columns n = col_names.size # allocate empty matrices cr = np.zeros((n, n)) crho = np.zeros((n, n)) cc = np.zeros((n, n)) for i in range(n): for j in range(n): x = ds[col_names[i]].values y = ds[col_names[j]].values xm = ma.masked_invalid(x) ym = ma.masked_invalid(y) cmask = xm.mask + ym.mask # a common mask x, y = x[~cmask], y[~cmask] # calculate correlation and similarity coefficients # if at least 7 data points are available for X and Y # at winsize set to 15 min if((x.size >= int(winsize/2)) and (x.size >= int(winsize/2))): cr[i,j] = np.corrcoef(x, y).data[0,1] # Pearson's r crho[i,j] = spearmanr(x, y)[0] # Spearman's rho cc[i,j] = czekanowski_norm(x, y) # Czekanowski's c else: cr[i,j] = np.nan crho[i,j] = np.nan cc[i,j] = np.nan # create correlation and similarity matrixes dcr = pd.DataFrame(cr, columns=col_names[0:n], index=col_names[0:n]) dcrho = pd.DataFrame(crho, columns=col_names[0:n], index=col_names[0:n]) dcc = pd.DataFrame(cc, columns=col_names[0:n], index=col_names[0:n]) # compute the average correlation and similarity measures based on # the above matrixes # # for Pearson correlation matrix nd = dcr.values.shape[0] ne = nd * nd # number of matrix elements ni = np.sum(np.isnan(dcr.values)) # number of invalid elements nv = ne - ni - nd # number of valid elements avg_pearson.append((np.nansum(dcr.values) - np.nansum(np.diagonal(dcr.values))) / nv) # for Spearman correlation matrix nd = dcrho.values.shape[0] ne = nd * nd # number of matrix elements ni = np.sum(np.isnan(dcrho.values)) # number of invalid elements nv = ne - ni - nd # number of valid elements avg_spearman.append((np.nansum(dcrho.values) - np.nansum(np.diagonal(dcrho.values))) / nv) # for Czekanowski similarity matrix nd = dcc.values.shape[0] ne = nd * nd # number of matrix elements ni = np.sum(np.isnan(dcc.values)) # number of invalid elements nv = ne - ni - nd # number of valid elements avg_czekanowski.append((np.nansum(dcc.values) - np.nansum(np.diagonal(dcc.values))) / nv) # create date&time index timestamp.append(str(idx[-1])) # store a copy of matrixes when the trough in prices is reached (precedent 15 min) if idx[-1] == '2021-05-19 13:32:00': dcr0 = dcr.copy() dcrho0 = dcrho.copy() dcc0 = dcc.copy()
In lines #162-165 we convert correlation/similarity matrixes into pandas’ DataFrames and calculate the average correlation and similarity measures based on them (lines #167-189). The average correlation (similarity) is computed as a sum of all non-NaN matrix elements minus the sum of non-NaN diagonal elements divided by a number of non-NaN elements excluding diagonal elements. For every sliding window, the averages are obtained and stored in a form of a new time-series (lines #175, 182, and 189).
Visualisation of correlation/similarity matrixes over the last 15 min before the trough of prices was reached,
fig, ax = plt.subplots(figsize=(10, 10)) img_cm = ax.matshow(dcr0, cmap='RdBu') plt.xticks(range(len(col_names)), col_names); plt.yticks(range(len(col_names)), col_names); plt.xticks(rotation=90) plt.title('Pearson Correlation Matrix') # create an axes on the right side of ax. The width of cax will be 5% # of ax and the padding between cax and ax will be fixed at 0.2 inch divider = make_axes_locatable(ax) cax = divider.append_axes("right", size="5%", pad=0.2) fig.colorbar(img_cm, cax=cax) plt.show() fig, ax = plt.subplots(figsize=(10, 10)) img_cm = ax.matshow(dcrho0, cmap='RdBu') plt.xticks(range(len(col_names)), col_names); plt.yticks(range(len(col_names)), col_names); plt.xticks(rotation=90) plt.title('Spearman Correlation Matrix') divider = make_axes_locatable(ax) cax = divider.append_axes("right", size="5%", pad=0.2) fig.colorbar(img_cm, cax=cax) plt.show() fig, ax = plt.subplots(figsize=(10, 10)) img_cz = ax.matshow(dcc0, cmap='OrRd') plt.xticks(range(len(col_names)), col_names); plt.yticks(range(len(col_names)), col_names); plt.xticks(rotation=90) plt.title('Czekanowski Index-based Similarity Matrix') divider = make_axes_locatable(ax) cax = divider.append_axes("right", size="5%", pad=0.2) fig.colorbar(img_cz, cax=cax) plt.show()
reveals the picture of correlation on the steep way down towards the trough
whereas
# group all results into one DataFrame results = pd.DataFrame(np.vstack((np.array(avg_pearson), np.array(avg_spearman), np.array(avg_czekanowski))).T, index=pd.to_datetime(timestamp, format='%Y-%m-%d %H:%M:%S'), columns=['AvgPearson', 'AvgSpearman', 'AvgCzekanowski']) # plot fig, ax = plt.subplots(figsize=(12,5)) ax.plot(df_norm) ax.grid() date_form = DateFormatter('%H:%M') ax.xaxis.set_major_formatter(date_form) plt.title('Normalised 39 Time-Series') _ = plt.xticks(rotation=30) fig, ax = plt.subplots(figsize=(12,5)) ax.plot(results) ax.grid() plt.title('Rolling Average Czekanowski Similarity and Correlation Measures') ax.legend(['Rolling Average %g-min Pearson Correlation' % winsize, 'Rolling Average %g-min Spearman Correlation' % winsize, 'Rolling Average %g-min Czekanowski Similarity' % winsize], loc=4) date_form = DateFormatter('%H:%M') ax.xaxis.set_major_formatter(date_form) ax.set_xlim([df.index[0], df.index[-1]]) _ = plt.xticks(rotation=30)
allows us to plot calculated rolling average Czekanowski similarity and correlation measures vs real (normalised) time-series patterns:
6. Conclusion from N-Asset Portfolio Case Study
These results bring us to the some conclusions. First, Czekanowski similarity measure as defined in this post works really well. The method of rolling average measure describes a similarity among 39 time-series in a way that it allows to consider it as an alternative correlation measure for N-asset portfolio analysis.
The amount of the average overlap among cryptocurrencies in this peculiar market event has been maintained at the very high level ($c$ reaching 0.925 in the trough) what alternatively support the results of Pearson and Spearman correlations. The co-movement in trading for majority of considered here crypto-asset followed the same dynamic and behaviour.
Interestingly, after 13:37 the average correlation significantly dropped while the similarity of trading patterns remained still at high level! This is quite surprising given our gut feelings built around the meaning of correlation in asset trading. Surprise, surprise! This opens up a new window of opportunities for the application of Czekanowski index-based similarity (average) measure in financial analysis of time-series and in trading.
DOWNLOAD
timeseries_1m_20210518_20210523.db
LITERATURE
Joseph Lee Rodgers; W. Alan Nicewander, 1988, Thirteen Ways to Look at the Correlation
Coefficient, The American Statistician, Vol. 42, No. 1., pp. 59-66
Jesse Aaron Marks, 2017, A Review Of Random Matrix Theory With An Application To ,
Biological Data, MSc Thesis, Missouri University Of Science And Technology
Patrick Schober; Christa Boer, 2018, Correlation Coefficients: Appropriate Use and,
Interpretation, Anesthesia & Analgesia
Bonett, Douglas G.; Thomas A. Wright, 2000, Sample Size Requirements for Estimating, Pearson Kendall and Spearman Correlations, Psychometrika 65, no. 1: 23–28
Bishara, Anthony J.; James B. Hittner, 2017, Confidence Intervals for Correlations When Data
Are Not Normal, Behavior Research Methods 49, no. 1: 294–309
1 comment