When we talk about correlations in finance, by default, we assume linear relationships between two time-series “co-moving”. In other words, if one time-series changes its values over a give time period, we seek for a tight correlation reflected within the other time-series. If found, we say they are correlated. But wait a minute! Is the game always about looking for an immediate feedback between two time-series? Is “correlated” always a synonymous of a direct, i.e. a linear response?
Think for a second about the oil prices. If they rise within a quarter, the effect on some companies is not the same. Certain firms can be tightly linked to oil/petroleum prices that affect their product value, distribution, sales, etc. while for other companies the change of crude oil price has a negligible effect, e.g. the online services. Is it so?
As a data scientist, quantitative (business) analysts, or investment researcher you need to know that mathematics and statistics deliver ample of tools you may use in order to derive the best possibilities for two (or more) financial assets that can be somehow related, correlated, or connected. The wording does not need to be precise to reflect the goal of an investigation: correlation, in general.
Correlation can be linear and non-linear. Non-linearity of correlation is somehow counterintuitive. We used to think in a linear way that is why it is so hard to readjust our thinking in a non-linear domain. The changes of the oil prices might have a non-negligble effect on the airlines, causing the air-ticket prices to rise or fall due to recalculated oil/petroleum surcharge. The effect can be not immediate. A rise of the oil prices today can be “detected” with a time delay of days, weeks, or even months, i.e. non-linearly correlated.
In this post analyse both linear and non-linear correlation methods that have been so far applied (and not) in the research over oil price tectonics across the markets. In particular, first we discuss one-factor and multiple linear regression models to turn, in consequence, toward non-linearity possible to be captured by cross-correlation and cross-bicorrelation methods. By developing an extensive but easy-to-follow Python code, we analyse a database of Quandl.com storing over 620 stocks’ financials factors (stock fundamentals). Finally, we show how both non-linear mathematical tools can be applied in order to detect non-linear correlations for different financial assets.
1. Linear Correlations
1.1. Oil Prices versus Stock Markets
Correlations between increasing/decreasing oil prices and stock markets have been a subject of investigation for a number of years. The main interest was initially in search for linear correlations between raw price/return time-series among the stocks and oil benchmarks (e.g. WPI Crude Oil). A justification standing behind such approach to data analysis seemed to be driven by a common sense: an impact of rising oil prices should affect companies dependent on petroleum-based products or services.
The usual presumption states that a decline in oil price is a good news for the economy, especially for net oil importers, e.g. USA or China. An increase in oil prices usually causes the raise of the input costs for most businesses and force consumers to spend more money on gasoline, thereby reducing the corporate earnings of other businesses. The opposite should be true too when the oil prices fall.
The oil prices are determined by the supply-and-damand for petroleum-based products. During economic expansion, prices might rise as a result of increased consumption. They might fall as a result of increased production. That pattern has been observed on many occasions so far.
A quantitative approach to measuring the strength of linear response of the stock market to oil price can be demonstrated using a plain linear regression modeling. If we agree that S&P 500 Index is a proxy for the US stock market, we should be able to capture the relationship. In the following example, we compare WPI Crude Oil prices with S&P 500 Index for 2-year period between Nov 2014 and Nov 2016. First we plot both time-series, and next we apply one-factor linear regression model,
$$
y(t) = \beta x(t) + \epsilon(t) \ \,
$$ in order to fit the data best:
import numpy as np import pandas as pd from scipy import stats from matplotlib import pyplot as plt from pandas_datareader import data, wb %matplotlib inline grey = 0.7, 0.7, 0.7 import warnings warnings.filterwarnings('ignore') # downloading S&P 500 Index from Yahoo! Finance sp500 = web.DataReader("^GSPC", data_source='yahoo', start='2014-11-01', end='2016-11-01')['Adj Close'] # WPI Crude Oil price-series # src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d') wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse) wpi = wpi[wpi.VALUE != "."] wpi.VALUE = wpi.VALUE.astype(float) wpi.index = wpi.DATE wpi.drop('DATE', axis=1, inplace=True) # combine both time-series data = pd.concat([wpi, sp500], axis=1).dropna() # and remove NaNs, if any data.columns = ["WPI", "S&P500"] print(data.head())
WPI S&P500 2014-11-03 78.77 2017.810059 2014-11-04 77.15 2012.099976 2014-11-05 78.71 2023.569946 2014-11-06 77.87 2031.209961 2014-11-07 78.71 2031.920044
plt.figure(figsize=(8,8)) # plot time-series ax1 = plt.subplot(2,1,1) plt.plot(data.WPI, 'k', label="WPI Crude Oil") plt.legend(loc=3) plt.ylabel("WPI Crude Oil [U$/barrel]") # fit one-factor linear model slope, intercept, r2_value, p_value, std_err = stats.linregress(data.WPI, data["S&P500"]) # best fit: model xline = np.linspace(np.min(data.WPI), np.max(data.WPI), 100) line = slope*xline + intercept # plot scatter plot and linear model ax2 = ax1.twinx() plt.plot(data["S&P500"], 'm', label="S&P500") plt.ylabel("S&P 500 Index [U$]") plt.legend(loc="best") plt.subplot(2,1,2) plt.plot(data.WPI, data["S&P500"],'.', color=grey) plt.plot(xline, line,'--') plt.xlabel("WPI Crude Oil [U$/barrel]") plt.ylabel("S&P 500 Index [U$]") plt.title("R$^2$ = %.2f" % r2_value, fontsize=11)
Our Python code generates:
where we can see that WPI Crude Oil price vs. S&P 500 linear correlation is weak, with $R^2 = 0.36$ only. This is a static picture.
As one might suspect, correlation is time-dependent. Therefore, it is wiser to check a rolling linear correlation for a given data window size. Below, we assume its length to be 50 days:
rollcorr = pd.rolling_corr(wpi, sp500, 50).dropna() plt.figure(figsize=(10,3)) plt.plot(rollcorr) plt.grid() plt.ylabel("Linear Correlation (50 day rolling window)", fontsize=8)
In 2008 Andrea Pescatori measured changes in the S&P 500 and oil prices in the same way as demonstrated above. He noted that variables occasionally moved in the same direction at the same time, but even then, the relationship remained weak. He concluded on no correlation at the 95% confidence level.
The lack of correlation might be attributed to factors like: wages, interest rates, industrial metal, plastic, computer technology that can offset changes in energy costs. On the other side, corporations might have become increasingly sophisticated at reading the futures markets and are able, much better, to anticipate the shift in factor prices (e.g. a company should be able to switch production processes to compensate for added fuel costs).
No matter what we analyse, either oil prices vs. market indexes or vs. individual stock trading records, linear correlation method is solely able to measure 1-to-1 response of $y$ to $x$. The lower coefficient of correlation the less valid linear model as a descriptor of true events and mutual relationships under study.
1.2. Multiple Linear Regression Model for Oil Price Changes
In February 2016 Ben S. Bernanke found that a positive correlation of stocks and oil might arise because both are responding to the underlying shift in global demand. Using a simple multiple linear regression model he took and attempt in explaining the daily changes of the oil prices and concluded that in over 90% they could be driven by changes of commodities prices (copper), US dollar (spot price), 10-yr treasury interest rate, and in over 95% by extending the model with an inclusion of daily changes in VIX:
$$
\Delta p_{oil, t} = \beta_1 \Delta p_{copp, t} + \beta_2 \Delta p_{10yr, t} + \beta_3 \Delta p_{USD, t} + \beta_4 \Delta p_{VIX, t} + \beta_0
$$ The premise is that commodity prices, long-term interest rate, and USD are likely to respond to investors’ perceptions of global and US demand for oil. Additionally, the presence of VIX is in line with the following idea: if investors retreat from commodities and/or stocks during periods of high uncertainty/risk aversion, then shocks to volatility may be another reason for the observed tendency of stocks and oil prices moving together.
The model, inspired by James Hamilton (2014) work, aims at measuring the effect of demand shifts on the oil market. For the sake of elegance, we can code it in Python as follows. First, we download to the files all necessary price-series and process them within pandas Dataframes (more on Quantitative Aspects of pandas for Quants you will find in my upcoming book Python for Quants. Volume II.).
import numpy as np import pandas as pd from scipy import stats import datetime from datetime import datetime from matplotlib import pyplot as plt import statsmodels.api as sm %matplotlib inline grey = 0.7, 0.7, 0.7 import warnings warnings.filterwarnings('ignore') # Download WPI Crude Oil prices # src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d') wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse) wpi = wpi[wpi.VALUE != "."] wpi.VALUE = wpi.VALUE.astype(float) wpi.index = wpi.DATE wpi.drop('DATE', axis=1, inplace=True) # Copper Futures # src: http://www.investing.com/commodities/copper-historical-data cop = pd.read_csv("COPPER.csv", index_col=["Date"]) cop.index = pd.to_datetime(pd.Series(cop.index), format='%b %d, %Y') # Nominal interest rate on 10-year Treasury bonds # src: https://fred.stlouisfed.org/series/DGS10/ tb = pd.read_csv('DGS10.csv') tb = tb[tb.DGS10 != "."] tb.DGS10 = tb.DGS10.astype(float) tb['DATE'] = pd.to_datetime(tb['DATE'], format='%Y-%m-%d') tb.index = tb.DATE tb.drop('DATE', axis=1, inplace=True) # Trade Weighted U.S. Dollar Index # src: https://fred.stlouisfed.org/series/DTWEXM usd = pd.read_csv('DTWEXM.csv') usd = usd[usd.DTWEXM != "."] usd.DTWEXM = usd.DTWEXM.astype(float) usd['DATE'] = pd.to_datetime(usd['DATE'], format='%Y-%m-%d') usd.index = usd.DATE usd.drop('DATE', axis=1, inplace=True) # CBOE Volatility Index: VIX© (VIXCLS) # src: https://fred.stlouisfed.org/series/VIXCLS/downloaddata vix = pd.read_csv('VIXCLS.csv') vix = vix[vix.VALUE != "."] vix.VALUE = vix.VALUE.astype(float) vix['DATE'] = pd.to_datetime(vix['DATE'], format='%Y-%m-%d') vix.index = vix.DATE vix.drop('DATE', axis=1, inplace=True)
Not every time a plain use of pd.read_csv function works in the way we expect from it, therefore some extra steps are needed to be implemented in order to convert .csv files content to Dataframe, e.g. a correct formatting of date (line #18-19 or #35, 44, 53), price conversion from string to float (lines #21, 34, 43, 52), replacement of index with data included in ‘Date’ column followed by its removal (e.g. lines #22-23).
As a side note, in order to download the copper data, at the website we need manually mark a whole table, paste it into Excel or Mac’s Numbers, trim if required, and save down to .csv file. The problem might appear as dates are given in “Nov 26, 2016” format. For this case, the following trick in Python saves a lot of time:
from datetime import datetime expr = 'Nov 26, 2016' datetime.strptime(expr, '%b %d, %Y')
datetime.datetime(2016, 11, 26, 0, 0)
what justifies the use of %b descriptor (see more on Python’s datetime date formats and parsing here and here).
pandas’ Dataframes are fast and convenient way to concatenate multiple time-series with indexes set to calendar dates. The following step ensures all five price-series to have data points on the same dates (if given), and if any data point (for any time-series) is missing (NaN), the whole row is removed from the end Dataframe of ‘df’ thanks to action of .dropna() function:
df = pd.concat([wpi, cop, tb, usd, vix], join='outer', axis=1).dropna() df.columns = ["WPI", "Copp", "TrB", "USD", "VIX"] print(df.head(10))
WPI Copp TrB USD VIX 2007-11-15 93.37 3.082 4.17 72.7522 28.06 2007-11-16 94.81 3.156 4.15 72.5811 25.49 2007-11-19 95.75 2.981 4.07 72.7241 26.01 2007-11-20 99.16 3.026 4.06 72.4344 24.88 2007-11-21 98.57 2.887 4.00 72.3345 26.84 2007-11-23 98.24 2.985 4.01 72.2719 25.61 2007-11-26 97.66 3.019 3.83 72.1312 28.91 2007-11-27 94.39 2.958 3.95 72.5061 26.28 2007-11-28 90.71 3.001 4.03 72.7000 24.11 2007-11-29 90.98 3.063 3.94 72.6812 23.97
Given all price-series and our model in mind, we are more interested in running the regression using daily change in the natural logarithm of crude oil, copper price, 10-yr interest rate, USD spot price, and VIX. The use of log is a common practice in finance and economics (see here why). Contrary to the daily percent change, we derive the log returns in Python in the following way:
np.random.seed(7) df = pd.DataFrame(100 + np.random.randn(100).cumsum(), columns=['price']) df['pct_change'] = df.price.pct_change() df['log_ret'] = np.log(df.price) - np.log(df.price.shift(1)) print(df.head())
price pct_change log_ret 0 101.690526 NaN NaN 1 101.224588 -0.004582 -0.004592 2 101.257408 0.000324 0.000324 3 101.664925 0.004025 0.004016 4 100.876002 -0.007760 -0.007790
Having that, we continue coding our main model:
headers = df.columns dlog = pd.DataFrame() # log returns Dateframe for j in range(df.shape[1]): price = df.ix[:,j] dl = np.log(price) - np.log(price.shift(1)) dlog.loc[:,j] = dl dlog.dropna(inplace=True) dlog.columns = headers print(dlog.head())
WPI Copp TrB USD VIX 2007-11-16 0.015305 0.023727 -0.004808 -0.002355 -0.096059 2007-11-19 0.009866 -0.057047 -0.019465 0.001968 0.020195 2007-11-20 0.034994 0.014983 -0.002460 -0.003992 -0.044417 2007-11-21 -0.005968 -0.047024 -0.014889 -0.001380 0.075829 2007-11-23 -0.003353 0.033382 0.002497 -0.000866 -0.046910
and run multiple linear regression model for in-sample data, here, selected for 2 year in-sample period between Nov 2012 and Nov 2014:
dlog0 = dlog.copy() # in-sample data selection dlog = dlog[(dlog.index > "2012-11-01") & (dlog.index < "2014-11-01")] # define input data y = dlog.WPI.values x = [dlog.Copp.values, dlog.TrB.values, dlog.USD.values, dlog.VIX.values ] # Multiple Linear Regression using 'statsmodels' library def reg_m(y, x): ones = np.ones(len(x[0])) X = sm.add_constant(np.column_stack((x[0], ones))) for ele in x[1:]: X = sm.add_constant(np.column_stack((ele, X))) results = sm.OLS(y, X).fit() return results # display a summary of the regression print(reg_m(y, x).summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.105 Model: OLS Adj. R-squared: 0.098 Method: Least Squares F-statistic: 14.49 Date: Tue, 29 Nov 2016 Prob (F-statistic): 3.37e-11 Time: 11:38:44 Log-Likelihood: 1507.8 No. Observations: 499 AIC: -3006. Df Residuals: 494 BIC: -2985. Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 -0.0303 0.008 -3.800 0.000 -0.046 -0.015 x2 -0.3888 0.178 -2.180 0.030 -0.739 -0.038 x3 0.0365 0.032 1.146 0.252 -0.026 0.099 x4 0.2261 0.052 4.357 0.000 0.124 0.328 const -3.394e-05 0.001 -0.064 0.949 -0.001 0.001 ============================================================================== Omnibus: 35.051 Durbin-Watson: 2.153 Prob(Omnibus): 0.000 Jarque-Bera (JB): 75.984 Skew: -0.392 Prob(JB): 3.16e-17 Kurtosis: 4.743 Cond. No. 338. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
A complete model is given by the following parameters:
par = reg_m(y, x).params print(par)
[ -3.02699345e-02 -3.88784262e-01 3.64964420e-02 2.26134337e-01 -3.39418282e-05]
or more formally as:
$$
\Delta p_{oil, t} = 0.2261 \Delta p_{copp, t} + 0.0365 \Delta p_{10yr, t} -0.3888 \Delta p_{USD, t} -0.0303 \Delta p_{VIX, t}
$$ The adjusted $R^2$ is near zero and statistical significance of $\beta_2$ is more than 5%.
In order to use our model for “future” oil price prediction based on price movements in commodity (copper), long-term interest rate, USD spot price and VIX index, we generate out-of-sample data set and construct the estimation in the following way:
# out-of-sample log return Dataframe dlog_oos = dlog0[dlog0.index >= '2014-11-01'] # what is the last WPI Crude Oil price in in-sample last = df[df.index == "2014-10-31"].values[0][0] # 80.53 USD/barrel oilreg = pd.DataFrame(columns=["Date", "OilPrice"]) # regressed oil prices par = reg_m(y, x).params # parameters from regression for i in range(dlog_oos.shape[0]): x1 = dlog_oos.iloc[i]["VIX"] x2 = dlog_oos.iloc[i]["USD"] x3 = dlog_oos.iloc[i]["TrB"] x4 = dlog_oos.iloc[i]["Copp"] w = par[0]*x1 + par[1]*x2 + par[2]*x3 + par[3]*x4 + par[4] if(i==0): oilprice = np.exp(w) * last else: oilprice = np.exp(w) * oilprice oilreg.loc[len(oilreg)] = [dlog_oos.index[i], oilprice] oilreg.index = oilreg.Date oilreg.drop('Date', axis=1, inplace=True)
where oilreg Dataframe stores time and regressed oil prices. Visualisation of the results
plt.figure(figsize=(10,5)) plt.plot(df.WPI, color=grey, label="WPI Crude Oil") plt.plot(oilreg, "r", label="Predicted Oil Price") plt.xlim(['2012-01-01', '2016-11-01']) # in-sample & out-of-sample plt.ylim([20, 120]) plt.ylabel("USD/barrel") plt.legend(loc=3) plt.grid()
delivers
Bernanke’s multiple regression model points at an idea saying: when stock traders respond to a change in oil prices, they do so not necessarily because the oil movement in consequential in itself. Our finding (for different data sets than Bernanke’s) points at lack of strong dependancy on VIX thus on traders activity. Again, as in case of one-factor regression model discussed in Section 1.1, the stability of underlying correlations with all four considered factors remains statistically variable in time.
2. Non-Linear Correlations
So far we have understood how fragile the quantification and interpretation of the linear correlation between two time-series can be. We agreed on its time-dependant character and how data sample size may influence derived results. Linear correlation is a tool that captures 1-to-1 feedback between two data sets. What if a response is lagged? We need to look for some other useful tools good enough for this challenge. In this Section we will analyse two of them.
2.1. Cross-Correlation and Cross-Bicorrelation
If a change of one quantity has a noticeable effect on another one, this can be observed with or without time delay. In the time-series analysis we talk about one time-series being lagged (or led) behind (before) the other one. Since an exisiting lag (lead) is not observed directly (a linear feedback) it introduces a non-linear relationship between two time-series.
A degree of non-linear correlation can be measured by the application of two independent methods, namely, cross-correlations and cross-bicorrelations designed to examine the data dependence between any two (three) values led/lagged by $r$ ($s$) periods. By period one should understand the sampling frequency of the underlying time-series, for example a quarter, i.e. 3 month period.
Let $x(t) \equiv \{ x_t \} = \{ x_1, x_2, …, x_N \}$ for $t=1,…,N$ represents, e.g. the WPI Crude Oil price-series, and $\{ y_t \}$ the time-series of a selected quantity. For two time-series, $x(t)$ and $y(t)$, the cross-correlation is defined as:
$$
C_{xy}(r) = (N-r)^{-1} \sum_{i=1}^{N-r} x(t_i) y(t_i + r)
$$ whereas the cross-bicorrelation is usually assumed as:
$$
C_{xyy}(r,s) = (N-m)^{-1} \sum_{i=1}^{N-m} x(t_i) y(t_i + r) y(t_i + s)
$$ where $m=\max(r,s)$ (Brooks & Hinich 1999). Both formulae measure the inner product between the individual values of two time-series. In case of cross-correlation $C_{xy}(r)$ it is given by $\langle x(t_i) y(t_i + r) \rangle$ where the value of $y(t)$ is lagged by $r$ periods. The summation, similarly as in the concept of the Fourier transform, picks up the maximum power for a unique set of lags $r$ or $(r,s)$ for $C_{xy}(r)$ and $C_{xyy}(r,s)$, respectively. The use of cross-correlation method should return the greatest value of $C_{xy}(r)$ among all examined $r$’s. Here, $1 \le r < N$ in a first approximation. It reveals a direct non-linear correlation between $x(t_i)$ and $y(t_i + r)$ $\forall i$.
The concept of cross-bicorrelation goes one step further and examines the relationship between $x(t_i)$ and $y(t_i + r)$ and $y(t_i + s)$, e.g. how the current oil price at time $t$, $x(t_i)$, affects $y$ lagged by $r$ and $s$ independent time periods. Note that cross-bicorrelation also allows for $C_{xxy}(r,s)$ form, i.e. if we require to look for relationship of the current and lagged (by $r$) values of $x$ time-series and period $y$ time-series lagged by $s$.
2.2. Statistical Tests for Cross-Correlation and Cross-Bicorrelation
The application of both tools for small data samples ($N < 50$) does not guarantee a statistically significant non-linear correlations every single time the computations are done. The need for statistical testing of the results returned by both cross-correlation and cross-bicorrelation methods has beed addressed by Hinich (1996) and Brook & Hinich (1999).
For two time-series, with a weak stationarity, the null hypothesis $H_0$ for the test is that two series are independent pure white noise processes against the alternative hypothesis $H_1$ saying that some cross-covariances, $E[x(t_i) y(t_i + r)]$, or cross-bicovariances, $E[x(t_i) y(t_i + r) y(t_i + s)]$, are nonzero. The test statistics, respectively for cross-correlation and cross-bicorrelation, are given by:
$$
H_{xy}(N) = \sum_{r=1}^{L} (N-L) C_{xy}^2(r) = (N-L) \sum_{r=1}^{L} C_{xy}^2(r)
$$ and
$$
H_{xyy}(N) = \sum_{s=-L}^{L} ‘ \sum_{r=1}^{L} (N-m) C_{xyy}^2(r,s) \ \ \ (‘-s \ne -1, 1, 0)
$$ where it has been shown by Hinich (1996) that:
$$
H_{xy}(N) \sim \chi^2_L
$$ and
$$
H_{xyy}(N) \sim \chi^2_{(L-1)(L/2)}
$$ where $\sim$ means are asymptotically distributed with the corresponding degrees for $N \rightarrow \infty$ and $m = \max(r, s)$ as previously. For both $H_{xy}(N)$ and $H_{xyy}(N)$ statistics, the number of lags $L$ is specified as $L = N^b$ where $0 < b < 0.5$. Based on the results of the Monte-Carlo simulations, Hinich & Parrerson (1995) recommended the use of $b=0.4$ in order to maximise the power of test while ensuring a valid approximation to the asymptotic theory.
Further down, we compute $H_{xy}(N)$ and $H_{xyy}(N)$ statistics for every cross-correlation and cross-bicorrelation and determine the corresponding significance assuming the confidence level of 90% ($\alpha = 0.1$) if:
$$
\mbox{p-value} < \alpha
$$ where
$$
\mbox{p-value} = 1 - \mbox{Pr} \left[ \chi^2_{\mbox{dof}} < H_{x(y)y} | H_1 \right] \ .
$$ Given that, any resultant values of cross-correlations and cross-bicorrelations one can assume as statistically significant if p-value $< \alpha$. A recommended number of lags to be investigated is $L = N^{0.4}$ where $N$ is the length of $x(t)$ and $y(t)$ time-series. Employing floor rounding for $L$ we obtain the following relationship between $N$ and $L$ for the sample sizes of $N \le 50$:
[code lang="python" gutter="false"]
L = []
for N in range(1, 51):
L.append(np.floor(N**0.4))
fig, ax = plt.subplots(figsize=(8,3))
plt.plot(np.arange(1, 51), L, '.')
plt.xlabel("N")
plt.ylabel("L")
ax.margins(x=0.025, y=0.05) # add extra padding
ax.tick_params(axis='y',which='minor',left='off') # without minor yticks
plt.grid(b=True, which='minor', color='k', linestyle=':', alpha=0.3)
plt.minorticks_on()
[/code]
2.3. Cross-Correlation and Cross-Bicorrelation in Python
Let’s translate cross-correlation to Python language and run a simple test for a random time-series:
# cross-correlation def Cxy(x, y, r, N): z = 0 for i in range(0, N-r-1): z += x[i] * y[i+r] z /= (N-r) return z # test statistic for cross-correlation def Hxy(x, y, L, N): z = 0 for r in range(1, L+1): z += Cxy(x, y, r, N)**2 z *= (N - L) return z
where we defined two functions deriving $C_{xy}(r)$ and $H_{xy}(N)$, respectively. Now, let’s consider some simplified time-series of $N=16$, both normalised to be $\sim N(0,1)$:
# sample data y = np.array([0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]) x = np.array([1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]) # normalise such that x, y ~ N(0,1) x = (x - np.mean(x))/np.std(x, ddof=1) y = (y - np.mean(y))/np.std(y, ddof=1)
As one can see, there is expected strong cross-correlation for lag of $r = 2$ (strongest) with $r = 1$ to be detected too. We verify the overall statistical significance of cross-correlations for given time-series in the following way:
N = len(x) L = int(np.floor(N**0.4)) for r in range(1, L+1): print("r =%2g\t Cxy(r=%g) = %.4f" % (r, r, Cxy(x, y, r))) dof = L pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof) alpha = 0.1 print("Hxy(N) = %.4f" % Hxy(x, y, L, N)) print("p-value = %.4f" % pvalue) print("H0 rejected in favour of H1 at %g%% c.l.: %s" % (100*(1-alpha), pvalue<alpha))
confirming both our expectations and significance of detected cross-correlations at 90% confidence level:
r = 1 Cxy(r=1) = 0.0578 r = 2 Cxy(r=2) = 0.8358 r = 3 Cxy(r=3) = -0.3200 Hxy(N) = 10.4553 p-value = 0.0151 H0 rejected in favour of H1 at 90% c.l.: True
For cross-bicorrelation we have:
# cross-bicorrelation def Cxyy(x, y, r, s, N): z = 0 m = np.max([r, s]) for i in range(0, N-m-1): z += x[i] * y[i+r] * y[i+s] z /= (N-m) return z # test statistic for cross-bicorrelation def Hxyy(x, y, L, N): z = 0 for s in range(2, L+1): for r in range(1, L+1): m = np.max([r, s]) z += (N-m) * Cxyy(x, y, r, s, N)**2 return z # sample data y = np.array([0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0]) x = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]) # normalise such that x, y ~ N(0,1) x = (x - np.mean(x))/np.std(x, ddof=1) y = (y - np.mean(y))/np.std(y, ddof=1) N = len(x) # len(x) should be equal to len(y) L = int(np.floor(N**0.4)) for r in range(0, L+1): for s in range(r+1, L+1): print("r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % (r, s, r, s, Cxyy(x, y, r, s, N))) dof = (L-1)*(L/2) pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof) alpha = 0.1 print("Hxyy(N) = %.4f" % Hxyy(x, y, L, N)) print("p-value = %.4f" % pvalue) print("H0 rejected in favour of H1 at %g%% c.l.: %s" % (100*(1-alpha), pvalue<alpha))
where for a new sample data we expect to detect two strongest cross-bicorrelations for pairs of lags $(r,s)$ equal $(2,3)$ and $(1,2)$, respectively. By running the code we obtain indeed that:
r = 0, s = 1 Cxyy(r=0, s=1) = -0.18168 r = 0, s = 2 Cxyy(r=0, s=2) = -0.23209 r = 0, s = 3 Cxyy(r=0, s=3) = 0.05375 r = 1, s = 2 Cxyy(r=1, s=2) = 0.14724 r = 1, s = 3 Cxyy(r=1, s=3) = -0.22576 r = 2, s = 3 Cxyy(r=2, s=3) = 0.18276 Hxyy(N) = 5.3177 p-value = 0.0833 H0 rejected in favour of H1 at 90% c.l.: True
With these two examples we showed both cross-correlation and cross-bicorrelation methods in action. We also pointed at the important fact of an existing dependency between the length of time-series and the expected number of lags to be examined. The longer time-series the higher number of degrees-of-freedom.
Given that state of the play, there exist two independent and appealing ways for the application of cross-correlation and cross-bicorrelation in finance: (1) the first one regards the examination of time-series with a small number of data points (e.g. $N \le 50$); (2) alternatively, if we deal with a long ($N \ge 1000$) time-series, the common practice is to divide them into $k$ chunks (non-overlapping windows) of length at least $N_i$ of $50$ to $100$ where $i=1,…,k$ and compute cross-correlation and cross-bicorrelation metrics for every data window separately. In consequence, within this approach, we gain an opportunity to report on time-period dependent data segment with statistically signifiant (or not) non-linear correlations! (see e.g. Coronado et alii 2015).
In the Section that follows, we solely focus on the application of the first approach. Rewarding in many aspects if you’re open-mined.
2.4. Detecting Non-Linearity between Oil Prices and Stock Fundamentals
Since looking for a direct correlation between crude oil prices and stocks can be regarded as the first approach to the examination of mutual relationships, it’s not sexy; not any more. With cross-correlation and cross-bicorrelation methods you can step up and break the boredom.
The stock fundamentals consist a broad space of parameters that actually deserve to be cross-correlated with the oil prices. It is much more intuitive that any (if any) effect of rising or falling crude oil prices may affect different companies with various time lags. Moreover, for some businesses we might detect a non-linear relationships existing only between oil and company’s e.g. revenues whereas for others, changing oil prices might affect more than one factor. If the patterns is present indeed, with the application of cross-correlation and cross-bicorrelation methods, you gain a brand new tool for data analysis. Let’s have a look at a practical example.
As previously, we will need both the WPI Crude Oil price-series (see Section 1.1) and access to a large database containing company’s fundamentals over a number of quarters or years. For the latter let me use a demo version of Quandl.com‘s Zacks Fundamentals Collection C available here (download Zacks Fundamentals Expanded ZACKS-FE.csv file). It contains 852 rows of data for 30 selected companies traded at US stock markets (NYSE, NASDAQ, etc.). This sample collection covers data back to 2011 i.e. ca. 22 to 23 quarterly reported financials. There is over 620 fundamentals to select from. Their full list you can find here (download the ZFC Definitions csv file). Let’s put it all in a new Python listing:
# Non-Linear Cross-Bicorrelations between the Oil Prices and Stock Fundamentals # (c) 2016 QuantAtRisk.com, by Pawel Lachowicz import numpy as np import pandas as pd import scipy.stats import datetime from matplotlib import pyplot as plt %matplotlib inline grey = 0.7, 0.7, 0.7 import warnings warnings.filterwarnings('ignore') # WTI Crude Oil price-series # src: https://fred.stlouisfed.org/series/DCOILWTICO/downloaddata dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d') wpi = pd.read_csv("DCOILWTICO.csv", parse_dates=['DATE'], date_parser=dateparse) wpi = wpi[wpi.VALUE != "."] wpi.VALUE = wpi.VALUE.astype(float) # Zacks Fundamentals Collection C (Fundamentals Expanded) # src: https://www.quandl.com/databases/ZFC df = pd.read_csv('ZACKS-FE.csv') col = ['Ticker', 'per_end_date', 'per_code', 'Revenue', "EBIT", "GrossProfit", "CAPEX"] data = pd.DataFrame(columns=col) # create an empty DataFrame tickers = ["XOM", "WMT", "BA", "AAPL", "JNJ"] for i in range(df.shape[0]): ticker = df.iloc[i]["ticker"] pc = df.iloc[i]["per_code"] # extract only reported fundamentals for quarters "QR0 and all "QR-" # in total 22 to 23 rows of data per stock if((pc[0:3] == "QR-") | (pc[0:3] == "QR0")) and (ticker in tickers): data.loc[len(data)] = [df.iloc[i]["ticker"], df.iloc[i]["per_end_date"], df.iloc[i]["per_code"], df.iloc[i]["tot_revnu"], df.iloc[i]["ebit"], df.iloc[i]["gross_profit"], df.iloc[i]["cap_expense"]] print(data.head(25))
where we created a new Dataframe, data, storing Total Revenue, EBIT, Gross Profit (revenue from operations less associated costs), and CAPEX (capital expenditure) for a sample of 5 stocks (XOM, WMT, BA, AAPL, and JNJ) out of 30 available in that demo collection. Our goal here is just to show a path for a possible analysis of all those numbers with, kept in mind, an encouragement to build a bigger tool for stock screening based on detection of non-linear correlations amongst selected fundamentals and oil prices.
First 25 lines of data Dataframe are:
Ticker per_end_date per_code Revenue EBIT GrossProfit CAPEX 0 AAPL 2011-03-31 QR-22 24667.0 7874.0 10218.0 -1838.0 1 AAPL 2011-06-30 QR-21 28571.0 9379.0 11922.0 -2615.0 2 AAPL 2011-09-30 QR-20 28270.0 8710.0 11380.0 -4260.0 3 AAPL 2011-12-31 QR-19 46333.0 17340.0 20703.0 -1321.0 4 AAPL 2012-03-31 QR-18 39186.0 15384.0 18564.0 -2778.0 5 AAPL 2012-06-30 QR-17 35023.0 11573.0 14994.0 -4834.0 6 AAPL 2012-09-30 QR-16 35966.0 10944.0 14401.0 -8295.0 7 AAPL 2012-12-31 QR-15 54512.0 17210.0 21060.0 -2317.0 8 AAPL 2013-03-31 QR-14 43603.0 12558.0 16349.0 -4325.0 9 AAPL 2013-06-30 QR-13 35323.0 9201.0 13024.0 -6210.0 10 AAPL 2013-09-30 QR-12 37472.0 10030.0 13871.0 -8165.0 11 AAPL 2013-12-31 QR-11 57594.0 17463.0 21846.0 -1985.0 12 AAPL 2014-03-31 QR-10 45646.0 13593.0 17947.0 -3367.0 13 AAPL 2014-06-30 QR-9 37432.0 10282.0 14735.0 -5745.0 14 AAPL 2014-09-30 QR-8 42123.0 11165.0 16009.0 -9571.0 15 AAPL 2014-12-31 QR-7 74599.0 24246.0 29741.0 -3217.0 16 AAPL 2015-03-31 QR-6 58010.0 18278.0 23656.0 -5586.0 17 AAPL 2015-06-30 QR-5 49605.0 14083.0 19681.0 -7629.0 18 AAPL 2015-09-30 QR-4 51501.0 14623.0 20548.0 -11247.0 19 AAPL 2015-12-31 QR-3 75872.0 24171.0 30423.0 -3612.0 20 AAPL 2016-03-31 QR-2 50557.0 13987.0 19921.0 -5948.0 21 AAPL 2016-06-30 QR-1 42358.0 10105.0 16106.0 -8757.0 22 AAPL 2016-09-30 QR0 46852.0 11761.0 17813.0 -12734.0 23 BA 2011-03-31 QR-22 14910.0 1033.0 2894.0 -417.0 24 BA 2011-06-30 QR-21 16543.0 1563.0 3372.0 -762.0
An exemplary visualisation of both data sets for selected company (here: Exxon Mobil Corporation, XOM, engaged in the exploration and production of crude oil and natural gas, manufacturing of petroleum products, and transportation and sale of crude oil, natural gas and petroleum products) we obtain with:
fig, ax1 = plt.subplots(figsize=(10,5)) plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)") plt.grid(True) plt.legend(loc=1) plt.ylabel("USD per barrel") selection = ["XOM"] ax2 = ax1.twinx() plt.ylabel("1e6 USD") for ticker in selection: tmp = data[data.Ticker == ticker] lab = ticker tmp['per_end_date'] = pd.to_datetime(tmp['per_end_date'], format='%Y-%m-%d') plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue") plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT") plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": Gross Profit") plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX") plt.legend(loc=3)
Every company has different periods of time (per_end_date) when they report their financials. That is why we need to make sure that in the process of comparison of selected fundamentals with oil prices, the oil price is picked up around the date defined by per_end_date. Let me use $\pm 5$ day average around those points:
# cross-correlation def Cxy(x, y, r, N): z = 0 for i in range(0, N-r-1): z += x[i] * y[i+r] z /= (N-r) return z # test statistic for cross-correlation def Hxy(x, y, L, N): z = 0 for r in range(1, L+1): z += Cxy(x, y, r, N)**2 z *= (N - L) return z # cross-bicorrelation def Cxyy(x, y, r, s, N): z = 0 m = np.max([r, s]) for i in range(0, N-m-1): z += x[i] * y[i+r] * y[i+s] z /= (N-m) return z def Hxyy(x, y, L, N): z = 0 for s in range(2, L+1): for r in range(1, L+1): m = np.max([r, s]) z += (N-m) * Cxyy(x, y, r, s, N)**2 return z fig, ax1 = plt.subplots(figsize=(10,5)) plt.plot(wpi.DATE, wpi.VALUE, color=grey, label="WPI Crude Oil (daily)") plt.grid(True) plt.legend(loc=1) plt.ylabel("USD per barrel") selection = ["XOM"] ax2 = ax1.twinx() plt.ylabel("1e6 USD") for ticker in selection: tmp = data[data.Ticker == ticker] lab = ticker tmp['per_end_date'] = pd.to_datetime(tmp['per_end_date'], format='%Y-%m-%d') plt.plot(tmp.per_end_date, tmp.Revenue, '.-', label=lab + ": Revenue") plt.plot(tmp.per_end_date, tmp.EBIT, '.-g', label=lab + ": EBIT") plt.plot(tmp.per_end_date, tmp.GrossProfit, '.-m', label=lab + ": Gross Profit") plt.plot(tmp.per_end_date, tmp.CAPEX, '.-r', label=lab + ": CAPEX") plt.legend(loc=3) col = ['per_end_date', 'avgWPI'] wpi2 = pd.DataFrame(columns=col) # create an empty DataFrame for i in range(tmp.shape[0]): date = tmp.iloc[i]["per_end_date"] date0 = date date1 = date + datetime.timedelta(days=-5) date2 = date + datetime.timedelta(days=+5) wpiavg = wpi[(wpi.DATE >= date1) & (wpi.DATE <= date2)] avg = np.mean(wpiavg.VALUE) wpi2.loc[len(wpi2)] = [date0, avg] # plot quarterly averaged oil price-series plt.sca(ax1) # set ax1 axis for plotting plt.plot(wpi2.per_end_date, wpi2.avgWPI, '.-k')
where a black line denotes quarterly changing WPI Crude Oil prices stored now in a new Dataframe of wpi2.
Let’s leave only XOM in selection list variable. After line #140 the Dataframe of tmp stores XOM’s selected fundamentals data. In what follows we will use both Dateframes (tmp and wpi2) to look for non-linear correlations as the final goal of this investigation:
# select some factors fundamentals = ["Revenue", "EBIT", "GrossProfit", "CAPEX"] # compute for them cross-correlations and cross-bicorrelations # and display final results # for f in fundamentals: print("%s: %s\n" % (ticker, f)) # input data x = wpi2.avgWPI.values y = tmp[f].values # normalised time-series x = (x - np.mean(x))/np.std(x, ddof=1) y = (y - np.mean(y))/np.std(y, ddof=1) N = len(x) # len(x) should be equal to len(y) L = int(np.floor(N**0.4)) print("Cross-Correlation") for r in range(1, L+1): print(" r =%2g\t Cxy(r=%g) = %.4f" % (r, r, Cxy(x, y, r, N))) dof = L pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof) alpha = 0.1 print(" Hxy(N) = %.4f" % Hxy(x, y, L, N)) print(" p-value = %.4f" % pvalue) print(" H0 rejected in favour of H1 at %g%% c.l.: %s" % (100*(1-alpha), pvalue<alpha)) print("Cross-Bicorrelation") for r in range(0, L+1): for s in range(r+1, L+1): print(" r =%2g, s =%2g\t Cxyy(r=%g, s=%g) = %.5f" % (r, s, r, s, Cxyy(x, y, r, s, N))) dof = (L-1)*(L/2) pvalue = 1 - scipy.stats.chi2.cdf(Hxy(x, y, L, N), dof) print(" Hxyy(N) = %.4f" % Hxyy(x, y, L, N)) print(" p-value = %.4f" % pvalue) print(" H0 rejected in favour of H1 at %g%% c.l.: %s\n" % (100*(1-alpha), pvalue<alpha))
Our Python code generates for XOM the following output:
XOM: Revenue Cross-Corelation r = 1 Cxy(r=1) = 0.7869 r = 2 Cxy(r=2) = 0.6239 r = 3 Cxy(r=3) = 0.4545 Hxy(N) = 24.3006 p-value = 0.0000 H0 rejected in favour of H1 at 90% c.l.: True Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = -0.44830 r = 0, s = 2 Cxyy(r=0, s=2) = -0.30186 r = 0, s = 3 Cxyy(r=0, s=3) = -0.19315 r = 1, s = 2 Cxyy(r=1, s=2) = -0.36678 r = 1, s = 3 Cxyy(r=1, s=3) = -0.22161 r = 2, s = 3 Cxyy(r=2, s=3) = -0.24446 Hxyy(N) = 10.3489 p-value = 0.0000 H0 rejected in favour of H1 at 90% c.l.: True XOM: EBIT Cross-Correlation r = 1 Cxy(r=1) = 0.7262 r = 2 Cxy(r=2) = 0.5881 r = 3 Cxy(r=3) = 0.3991 Hxy(N) = 20.6504 p-value = 0.0001 H0 rejected in favour of H1 at 90% c.l.: True Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = -0.39049 r = 0, s = 2 Cxyy(r=0, s=2) = -0.25963 r = 0, s = 3 Cxyy(r=0, s=3) = -0.17480 r = 1, s = 2 Cxyy(r=1, s=2) = -0.28698 r = 1, s = 3 Cxyy(r=1, s=3) = -0.18030 r = 2, s = 3 Cxyy(r=2, s=3) = -0.21284 Hxyy(N) = 6.8759 p-value = 0.0001 H0 rejected in favour of H1 at 90% c.l.: True XOM: GrossProfit Cross-Correlation r = 1 Cxy(r=1) = 0.7296 r = 2 Cxy(r=2) = 0.5893 r = 3 Cxy(r=3) = 0.3946 Hxy(N) = 20.7065 p-value = 0.0001 H0 rejected in favour of H1 at 90% c.l.: True Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = -0.38552 r = 0, s = 2 Cxyy(r=0, s=2) = -0.24803 r = 0, s = 3 Cxyy(r=0, s=3) = -0.17301 r = 1, s = 2 Cxyy(r=1, s=2) = -0.28103 r = 1, s = 3 Cxyy(r=1, s=3) = -0.17393 r = 2, s = 3 Cxyy(r=2, s=3) = -0.19780 Hxyy(N) = 6.2095 p-value = 0.0001 H0 rejected in favour of H1 at 90% c.l.: True XOM: CAPEX Cross-Correlation r = 1 Cxy(r=1) = -0.2693 r = 2 Cxy(r=2) = -0.3024 r = 3 Cxy(r=3) = -0.2333 Hxy(N) = 4.3682 p-value = 0.2244 H0 rejected in favour of H1 at 90% c.l.: False Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = 0.00844 r = 0, s = 2 Cxyy(r=0, s=2) = -0.10889 r = 0, s = 3 Cxyy(r=0, s=3) = -0.14787 r = 1, s = 2 Cxyy(r=1, s=2) = -0.10655 r = 1, s = 3 Cxyy(r=1, s=3) = -0.16628 r = 2, s = 3 Cxyy(r=2, s=3) = -0.08072 Hxyy(N) = 6.4376 p-value = 0.2244 H0 rejected in favour of H1 at 90% c.l.: False
A quick analysis points at detection of significant non-linear correlations (at 90% confidence level) between WPI Crude Oil and Revenue, EBIT, and GrossProfit for lag $r=1$ (based on cross-correlations) and for lags $(r=0, s=1)$ (based on cross-bicorrelations). There is no significant non-linear relationship between XOM’s CAPEX and oil prices.
Interestingly, rerunning the code for Boeing Company (NYSE ticker: BA) reveals no significant cross-(bi)correlations between the same factors and oil. For example, Boeing’s revenue improves year-over-year,
but it appears to have any significant non-linear link to the oil prices whatsoever:
BA: Revenue Cross-Correlation r = 1 Cxy(r=1) = -0.3811 r = 2 Cxy(r=2) = -0.3216 r = 3 Cxy(r=3) = -0.1494 Hxy(N) = 5.4200 p-value = 0.1435 H0 rejected in favour of H1 at 90% c.l.: False Cross-Bicorrelation r = 0, s = 1 Cxyy(r=0, s=1) = 0.11202 r = 0, s = 2 Cxyy(r=0, s=2) = 0.03588 r = 0, s = 3 Cxyy(r=0, s=3) = -0.08083 r = 1, s = 2 Cxyy(r=1, s=2) = 0.01526 r = 1, s = 3 Cxyy(r=1, s=3) = 0.00616 r = 2, s = 3 Cxyy(r=2, s=3) = -0.07252 Hxyy(N) = 0.3232 p-value = 0.1435 H0 rejected in favour of H1 at 90% c.l.: False
The most interesting aspect of cross-correlation and cross-bicorrelation methods applied in oil vs. stock fundamentals research regards new opportunities to discover correlations for stocks that so far have been completely ignored (or not suspected) to have anything in common with oil price movements in the markets. Dare to go this path. I hope you will find new assets to invest in. The information is there. Now you are equipped with new tools. Use them wisely. Retire early.
REFERENCES
Bernanke B. S., 2016, The relationship between stocks and oil prices
Brooks C., Hinich M. J., 1999, Cross-correlations and cross-bicorrelations in Sterling
exchange rates (pdf)
Coronado et al., 2015, A study of co-movements between USA and Latin American stock
markets: a cross-bicorrelations perspective (pdf)
Hamilton J., 2014, Oil prices as an indicator of global economic conditions
Hinich M.J., 1996. Testing for dependence in the input to a linear time series model.
Journal of Nonparametric Statistics 6, 205–221.
Hinich M.J., Patterson D.M., 1995, Detecting epochs of transient dependence in white
noise, Mimeo, University of Texas at Austin
DOWNLOADS
COPPER.csv, DCOILWTICO.csv, DGS10.csv, DTWEXM.csv, VIXCLS.csv, ZACKS-FE.csv
1 comment
Hi Pawel,
Very interesting and complete. Thanks for this contribution.