When in 2010 I lived in Singapore I crossed my life path with some great guys working in the field of High-Performance Computing: Łukasz Orłowski, Marek T. Michalewicz, Iain Bell from Quadrant Capital. They both pointed my attention towards GPU computations utilizing Nvidia CUDA architecture. There was one problem. Everything was wrapped up with C syntax with a promise to do it more efficiently in C++. Soon.
So I waited and studied C/C++ at least at the level allowing me to understand some CUDA codes. Years were passing by until the day when I discovered an article of Mark Harris, NumbaPro: High-Performance Python with CUDA Acceleration, delivering Python-friendly CUDA solutions to all my nightmares involving C/C++ coding. At this point you just need to understand one thing: not every quant or algo trader is a fan of C/C++ the same as some people prefer Volvo to Audi, including myself ;)
Let’s have a sincere look at what the game is about. It is more than tempting to put your hands on a piece of code that allows you to speed-up some quantitative computations in Python making use of a new library.
Accelerate your Python
I absolutely love Continuum Analytics for the mission they stand for: making Python language easily accessible and used by everyone worldwide! It is a great language with a great syntax, easy to pick up, easy to be utilised in the learning process of the fundamentals of programming. Thanks to them, now you can download and install Python’s distribution of Anaconda for Windows, Mac OS X, or Linux just in few minutes.
When you visit their webpage you can spot Anaconda’s Add-Ons, three additional software packages to their Python distribution. Among them, they offer Accelerate module containing NumbaPro library. Once you read the description, and I quote,
Accelerate is an add-on to Continuum’s free enterprise Python distribution, Anaconda. It opens up the full capabilities of your GPU or multi-core processor to Python. Accelerate includes two packages that can be added to your Python installation: NumbaPro and MKL Optimizations. MKL Optimizations makes linear algebra, random number generation, Fourier transforms, and many other operations run faster and in parallel. NumbaPro builds fast GPU and multi-core machine code from easy-to-read Python and NumPy code with a Python-to-GPU compiler.
NumbaPro Features
– NumbaPro compiler targets multi-core CPU and GPUs directly from
simple Python syntax
– Easily move vectorized NumPy functions to the GPU
– Multiple CUDA device support
– Bindings for CUDA libraries, including cuBlas, cuRand, cuSparse, and cuFFT
– Support for array slicing and fast array math
– Use multiple threads without worrying about the GIL
– Supported on NVIDIA CUDA-enabled GPUs with compute capability 2.0
or above on Intel/AMD (x86) processors.
your blood pressure increases and the level of endorphins skyrockets. Why? Simply because of the promise to do some tasks faster utilising GPU in a parallel mode! If you are new to GPU or CUDA I recommend you to read some well written posts on Mike’s website, for instance, Installing Nvidia CUDA on Mac OSX for GPU-based Parallel Computing or Monte Carlo Simulations in CUDA – Barrier Option Pricing. You will grasp the essence of what is all about. In general, much ado about CUDA is still around making use of your GPU and proving this extra upmhhh in speedup. If you have any quantitative problem in mind and it can be executed in the parallel mode, NumbaPro is a tool you need to look at but – not every engine sounds the same. Hold on till the end of this post. It will be worth it.
Selling the Speed
When you approach a new concept or a new product and someone tries to sell it to you, he needs to impress you to win your attention and boost your curiosity. Imagine for a moment that you have no idea about GPU or CUDA and you want to add two vectors. In Python you can do it as follows:
import numpy as np from timeit import default_timer as timer def VectorAdd(a,b,c): for i in xrange(a.size): c[i]=a[i]+b[i] def main(): N=32000000 A=np.ones(N, dtype=np.float32) B=np.ones(N, dtype=np.float32) C=np.zeros(N, dtype=np.float32) start=timer() VectorAdd(A,B,C) totaltime=timer()-start print("\nCPU time: %g sec" % totaltime) if __name__ == '__main__': main()
We all know that once you run the code, Python does not compile it, it goes line by line and interprets what it reads. So, we aim at adding two vectors, $A$ and $B$, containing 32 millions of elements. I get $C=A+B$ matrix on my MacBook Pro (2.6 GHz Intel Core i7, 16 GB 1600 MHz DDR3 RAM, NVIDIA GeForce GT 650M 1GB) after:
CPU time: 9.89753 sec
Can we do it better? With NumbaPro the required changes to the code itself are minor. All we need to add is a function decorator that tells how and where the function should be executed. In fact, what NumbaPro does is that it “compiles” VectorAdd function on-the-go and deploys computations to the GPU unit:
import numpy as np from timeit import default_timer as timer from numbapro import vectorize @vectorize(["float32(float32,float32)"], target="gpu") def VectorAdd(a,b): return a+b def main(): N=32000000 A=np.ones(N, dtype=np.float32) B=np.ones(N, dtype=np.float32) C=np.zeros(N, dtype=np.float32) start=timer() C=VectorAdd(A,B) totaltime=timer()-start print("\nGPU time: %g sec" % totaltime) if __name__ == '__main__': main()
We get
GPU time: 0.286101 sec
i.e. 34.6x speed-up. Not bad, right?! Not bad if you’re a sale person indeed! But, hey, what’s that?:
import numpy as np from timeit import default_timer as timer def main(): N=32000000 A=np.ones(N, dtype=np.float32) B=np.ones(N, dtype=np.float32) C=np.zeros(N, dtype=np.float32) start=timer() C=A+B totaltime=timer()-start print("\nCPU time: %g sec" % totaltime) if __name__ == '__main__': main()
Run it to discover that:
CPU time: 0.0592878 sec
i.e. 4.82x faster than using GPU. Oh, boy! CUDA:NumPy (0:1).
Perfect Pitch
When Nvidia introduced CUDA among some exemplary C codes utilising CUDA programming we could find an immortal Black-Scholes model for option pricing. In this Nobel-prize winning solution, we derive a call option price for non-dividend-paying underlying stock:
$$
C(S,t) = N(d_1)S – N(d_2)Ke^{-r(T-t)}
$$
where $(T-t)$ is the time to maturity (scalar), $r$ is the risk free rate (scalar), $S$ is the spot price of the underlying asset, $K$ is the strike price, and $\sigma$ is the volatility of returns of the underlying asset. $N(\dot)$ is the cumulative distribution function (cnd) of the standard normal distribution and has an analytical form. A classical way to code it in Python is:
import numpy as np import time RISKFREE = 0.02 VOLATILITY = 0.30 def cnd(d): A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 K = 1.0 / (1.0 + 0.2316419 * np.abs(d)) ret_val = (RSQRT2PI * np.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) return np.where(d > 0, 1.0 - ret_val, ret_val) def black_scholes(callResult, putResult, stockPrice, optionStrike, optionYears, Riskfree, Volatility): S = stockPrice X = optionStrike T = optionYears R = Riskfree V = Volatility sqrtT = np.sqrt(T) d1 = (np.log(S / X) + (R + 0.5 * V * V) * T) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd(d1) cndd2 = cnd(d2) expRT = np.exp(- R * T) callResult[:] = (S * cndd1 - X * expRT * cndd2) def randfloat(rand_var, low, high): return (1.0 - rand_var) * low + rand_var * high def main (*args): OPT_N = 4000000 iterations = 10 if len(args) >= 2: iterations = int(args[0]) callResult = np.zeros(OPT_N) stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0) optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0) optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0) time0 = time.time() for i in range(iterations): black_scholes(callResult, putResult, stockPrice, optionStrike, optionYears, RISKFREE, VOLATILITY) time1 = time.time() print("Time: %f msec per option" % ((time1-time0)/iterations/OPT_N*1000)) if __name__ == "__main__": import sys main(*sys.argv[1:])
what returns
Time: 0.000192 msec per option
The essence of this code is to derive 4 million independent results based on feeding the function with random stock prices, option strike prices, and times to maturity. They enter the game under cover as row vectors with randomised values (see lines #45-47). Anaconda Accelerate’s CUDA solution for the same code is:
import numpy as np import math import time from numba import * from numbapro import cuda from blackscholes import black_scholes # save the previous code as # black_scholes.py RISKFREE = 0.02 VOLATILITY = 0.30 A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 @cuda.jit(argtypes=(double,), restype=double, device=True, inline=True) def cnd_cuda(d): K = 1.0 / (1.0 + 0.2316419 * math.fabs(d)) ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) if d > 0: ret_val = 1.0 - ret_val return ret_val @cuda.jit(argtypes=(double[:], double[:], double[:], double[:], double[:], double, double)) def black_scholes_cuda(callResult, putResult, S, X, T, R, V): # S = stockPrice # X = optionStrike # T = optionYears # R = Riskfree # V = Volatility i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x if i >= S.shape[0]: return sqrtT = math.sqrt(T[i]) d1 = (math.log(S[i] / X[i]) + (R + 0.5 * V * V) * T[i]) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd_cuda(d1) cndd2 = cnd_cuda(d2) expRT = math.exp((-1. * R) * T[i]) callResult[i] = (S[i] * cndd1 - X[i] * expRT * cndd2) def randfloat(rand_var, low, high): return (1.0 - rand_var) * low + rand_var * high def main (*args): OPT_N = 4000000 iterations = 10 callResultNumpy = np.zeros(OPT_N) putResultNumpy = -np.ones(OPT_N) stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0) optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0) optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0) callResultNumba = np.zeros(OPT_N) putResultNumba = -np.ones(OPT_N) callResultNumbapro = np.zeros(OPT_N) putResultNumbapro = -np.ones(OPT_N) # Numpy ---------------------------------------------------------------- time0 = time.time() for i in range(iterations): black_scholes(callResultNumpy, putResultNumpy, stockPrice, optionStrike, optionYears, RISKFREE, VOLATILITY) time1 = time.time() dtnumpy = ((1000 * (time1 - time0)) / iterations)/OPT_N print("\nNumpy Time %f msec per option") % (dtnumpy) # CUDA ----------------------------------------------------------------- time0 = time.time() blockdim = 1024, 1 griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1 stream = cuda.stream() d_callResult = cuda.to_device(callResultNumbapro, stream) d_putResult = cuda.to_device(putResultNumbapro, stream) d_stockPrice = cuda.to_device(stockPrice, stream) d_optionStrike = cuda.to_device(optionStrike, stream) d_optionYears = cuda.to_device(optionYears, stream) time2 = time.time() for i in range(iterations): black_scholes_cuda[griddim, blockdim, stream]( d_callResult, d_putResult, d_stockPrice, d_optionStrike, d_optionYears, RISKFREE, VOLATILITY) d_callResult.to_host(stream) d_putResult.to_host(stream) stream.synchronize() time3 = time.time() dtcuda = ((1000 * (time3 - time2)) / iterations)/OPT_N print("Numbapro CUDA Time %f msec per option (speed-up %.1fx) \n") % (dtcuda, dtnumpy/dtcuda) # print(callResultNumbapro) if __name__ == "__main__": import sys main(*sys.argv[1:])
returning
Numpy Time 0.000186 msec per option Numbapro CUDA Time 0.000024 msec per option (speed-up 7.7x)
In order to understand why CUDA wins over NumPy this time is not so difficult. First we have a programmable analytical form of the problem. We deploy it to GPU and perform exhaustive calculations involving cnd_cuda function for the estimation of the cumulative distribution function of the standard normal distribution. Splitting the task into many concurrently running threats on GPU reduces the time. Again, it’s possible because all option prices can be computed independently. CUDA:NumPy (1:1).
Multiplied Promises
In finance, the concept of portfolio optimization is well established. The idea standing behind is to find such a vector of weights, $w$, for all assets that the derived estimated portfolio risk ($\sigma_P$) and return ($\mu_P$) meets our needs or expectations.
An alternative (but not greatly recommended) approach would involve optimization through randomisation of $w$ vectors. We could generate a big number of them, say $N$, in order to obtain:
$$
\mu_{P,i} = m w_i^T \ \ \ \mbox{and} \ \ \ \sigma_{P,i} = w_iM_2w_i^T
$$ for $i=1,…,N$. Here, $m$ is a row-vector holding estimated expected returns for all assets in portfolio $P$ and based on return-series we end up with $M_2$ covariance matrix $(MxM$ where $M$ is a number of assets in $P$). In the first case, we aim at the multiplication of row-vector with a transposed row-vector of weight whereas for the latter we perform the multiplication of the row-vector with the square matrix (as the first operation). If $N$ is really big, say a couple of millions, the computation could be somehow accelerated using GPU. Therefore, we need a code for matrix multiplication on GPU in Python.
Let’s consider a more advanced concept of ($K\times K$)$\times$($K\times K$) matrix multiplication. If that works faster, than our random portfolios problem should be even faster. Continuum Analytics provides with a ready-to-use solution:
import numpy as np from numbapro import cuda import numba from timeit import default_timer as timer from numba import float32 bpg = 32 tpb = 32 n = bpg * tpb shared_mem_size = (tpb, tpb) griddim = bpg, bpg blockdim = tpb, tpb @numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])") def naive_matrix_mult(A, B, C): x, y = cuda.grid(2) if x >= n or y >= n: return C[y, x] = 0 for i in range(n): C[y, x] += A[y, i] * B[i, x] @numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])") def optimized_matrix_mult(A, B, C): # Declare shared memory sA = cuda.shared.array(shape=shared_mem_size, dtype=float32) sB = cuda.shared.array(shape=shared_mem_size, dtype=float32) tx = cuda.threadIdx.x ty = cuda.threadIdx.y x, y = cuda.grid(2) acc = 0 for i in range(bpg): if x < n and y < n: # Prefill cache sA[ty, tx] = A[y, tx + i * tpb] sB[ty, tx] = B[ty + i * tpb, x] # Synchronize all threads in the block cuda.syncthreads() if x < n and y < n: # Compute product for j in range(tpb): acc += sA[ty, j] * sB[j, tx] # Wait until all threads finish the computation cuda.syncthreads() if x < n and y < n: C[y, x] = acc # Prepare data on the CPU A = np.array(np.random.random((n, n)), dtype=np.float32) B = np.array(np.random.random((n, n)), dtype=np.float32) print "(%d x %d) x (%d x %d)" % (n, n, n, n) # Prepare data on the GPU dA = cuda.to_device(A) dB = cuda.to_device(B) dC = cuda.device_array_like(A) # Time the unoptimized version s = timer() naive_matrix_mult[griddim, blockdim](dA, dB, dC) numba.cuda.synchronize() e = timer() unopt_ans = dC.copy_to_host() tcuda_unopt = e - s # Time the optimized version s = timer() optimized_matrix_mult[griddim, blockdim](dA, dB, dC) numba.cuda.synchronize() e = timer() opt_ans = dC.copy_to_host() tcuda_opt = e - s assert np.allclose(unopt_ans, opt_ans) print "CUDA without shared memory:", "%.2f" % tcuda_unopt, "s" print "CUDA with shared memory :", "%.2f" % tcuda_opt, "s" s = timer() np.dot(A,B) e = timer() npt=e-s print "NumPy dot product :", "%.2f" % npt, "s"
what returns
(1024 x 1024) x (1024 x 1024) CUDA without shared memory: 0.76 s CUDA with shared memory : 0.25 s NumPy dot product : 0.06 s
and leads to CUDA:NumPy (1:2) score of the game. The natural questions arise. Is it about the matrix size? Maybe it is too simple problem we try to solve it with an improper tool? Or the way how we approach matrix allocation and deployment to GPU itself?
The last question made me digging deeper. In Python you can create one big matrix holding a number of smaller matrices. The following code tries to perform 4 million $(2\times 2)$ matrix multiplications where matrix $B$ is randomised every single time (see our random portfolio problem). In Anaconda Accelerate we achieve it as follows:
import numbapro import numba.cuda import numpy as np from timeit import default_timer as timer # Use the builtin matrix_multiply in NumPy for CPU test import numpy.core.umath_tests as ut @numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m, n),(n, p)->(m, p)', target='gpu') def batch_matrix_mult(a, b, c): for i in range(c.shape[0]): for j in range(c.shape[1]): tmp = 0 for n in range(a.shape[1]): tmp += a[i, n] * b[n, j] c[i, j] = tmp def main(): n = 4000000 dim = 2 sK=0 KN=10 for K in range(KN): # Matrix Multiplication: c = a x b a = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) c = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) # NUMPY ------------------------------------------------------------- start = timer() b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) d=ut.matrix_multiply(a, b) np_time=timer()-start # CUDA -------------------------------------------------------------- dc = numba.cuda.device_array_like(c) da = numba.cuda.to_device(a) start = timer() b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) db = numba.cuda.to_device(b) batch_matrix_mult(da, db, out=dc) numba.cuda.synchronize() dc.copy_to_host(c) cuda_time=timer()-start sK += np_time/cuda_time del da, db print("\nThe average CUDA speed-up: %.5fx") % (sK/KN) if __name__ == '__main__': main()
leading us to
The average CUDA speed-up: 0.79003x
i.e. deceleration. Playing with the sizes of matrices and their number may result in error caused by GPU memory required for matrix $A$ and $B$ allocation on GPU. It seems we have CUDA:NumPy (1:3).
Black Magic in Black Box
I approached NumbaPro solution as a complete rookie. I spent a considerable amount of time searching for Anaconda Accelerate’s GPU codes demonstrating massive speed-ups as promised. I found different fragments in different places across the Web. With the best method known among all beginners, namely, copy and paste, I re-ran what I found. Then modified and re-ran again. And again, and again. I felt the need, the need for speed! But failed finding my tail wind.
This post may demonstrate my lack of understanding of what is going on or reveal a blurred picture standing behind: a magic that works if you know all the tricks. I hope that at least you enjoyed the show!
8 comments
Great post Pawel. I really like your writing style.
Appreciate your effort of testing the libraries and thank you for shearing the results.
I am a developer of NumbaPro and I wrote most of the examples that you referenced. I wanted to point out that the matrix-multiplication kernel was meant for demonstration. It is not fast. You should use the GEMM method from cuBLAS for performance critical application. See example https://gist.github.com/sklam/dc8fce2cf37b82f66a48#file-randomportfolios-py-L92. On a same Macbook Pro model that you have, the cuBLAS GEMM can be 2x faster than MKL-accelerated NumPy dot:
(1024 x 1024) x (1024 x 1024)
CUDA without shared memory: 0.765 s
CUDA with shared memory : 0.251 s
CUDA BLAS GEMM : 0.010 s
NumPy dot product : 0.020 s
In your batch matrix-multiplication code, a small enhancement can improve the speed a lot. Instead of generating the random number on the CPU and copy it to the GPU, generate it with cuRAND on the GPU https://gist.github.com/sklam/dc8fce2cf37b82f66a48#file-batch_matrix_mult-py-L56. The new script can achieve 1.3x-2.5x speedup. (The variation in speed is probably due to dynamic GPU clockrate.)
I also wanted to point out that the GT 650M GPU is not powerful. On a Linux system with a better GPU (K20c) . I get the following:
$ python batch_matrix_mult.py
The average CUDA speed-up: 9.61722x
$ python randomportfolios.py
(1024 x 1024) x (1024 x 1024)
CUDA without shared memory: 0.078 s
CUDA with shared memory : 0.028 s
CUDA BLAS GEMM : 0.001 s
NumPy dot product : 0.025 s
I understand that NumbaPro still requires too much CUDA knowledge to be useful. The documentation can also be improved. We are designing new features to reduce programmers’ effort in getting optimal results. If you have difficulty speeding-up code, please feel free to post questions on the Numba mailing list.
Siu, many thanks! This is the answer and clarification I’ve been waiting for. Highly appreciated on behalf of all Pythonists :)
Your first example isn’t really a good one. Since numpy operators work by element, you should not write a function to add up arrays, because that makes it really slow. Let the built-in operators handle that;
import numpy as np
from timeit import default_timer as timer
def main():
N=32000000
A=np.ones(N, dtype=np.float32)
B=np.ones(N, dtype=np.float32)
C=np.zeros(N, dtype=np.float32)
start=timer()
C = A + B
totaltime=timer()-start
print("nCPU time: %g sec" % totaltime)
if __name__ == '__main__':
main()
On my machine, this gives the following result;
CPU time: 0.143835 sec
Your original program took about 18 seconds on my machine…
Hi Roland. You’re correct but I wrote this paragraph the way how the company (CA) promotes its GPU accelerated version across the Web. I haven’t written that block of code. I quoted them and that’s why the title is “Selling the Speed” with the emphasis on selling in the text :-) Cheers!
Makes you wonder why they would use such a contrived example?
Another avenue that you could pursue is using
multiprocessing
, to make proper use of multiple cores if you don’t have access to the MKL libraries.I think most people with a cursory knowledge of python would understand that the first example will be very slow. That seems to be precisely why he did the third example…
Nevertheless, I can also think of a number of cases where you can’t use parallel processing to speed up loops. For instance, MCMC or where you need to optimize a portfolio on several dates incorporating turnover. Neither of those cases allow the loops to be independent of each other.
Drawing an efficient frontier might be a better example though.
Good post.