GPU-Accelerated Finance in Python with NumbaPro Library. Really?

When in 2010 I lived in Singapore I crossed my life path with some great guys working in the field of High-Performance Computing: Łukasz Orłowski, Marek T. Michalewicz, Iain Bell from Quadrant Capital. They both pointed my attention towards GPU computations utilizing Nvidia CUDA architecture. There was one problem. Everything was wrapped up with C syntax with a promise to do it more efficiently in C++. Soon.

So I waited and studied C/C++ at least at the level allowing me to understand some CUDA codes. Years were passing by until the day when I discovered an article of Mark Harris, NumbaPro: High-Performance Python with CUDA Acceleration, delivering Python-friendly CUDA solutions to all my nightmares involving C/C++ coding. At this point you just need to understand one thing: not every quant or algo trader is a fan of C/C++ the same as some people prefer Volvo to Audi, including myself ;)




Let’s have a sincere look at what the game is about. It is more than tempting to put your hands on a piece of code that allows you to speed-up some quantitative computations in Python making use of a new library.

Accelerate your Python

I absolutely love Continuum Analytics for the mission they stand for: making Python language easily accessible and used by everyone worldwide! It is a great language with a great syntax, easy to pick up, easy to be utilised in the learning process of the fundamentals of programming. Thanks to them, now you can download and install Python’s distribution of Anaconda for Windows, Mac OS X, or Linux just in few minutes.

When you visit their webpage you can spot Anaconda’s Add-Ons, three additional software packages to their Python distribution. Among them, they offer Accelerate module containing NumbaPro library. Once you read the description, and I quote,

Accelerate is an add-on to Continuum’s free enterprise Python distribution, Anaconda. It opens up the full capabilities of your GPU or multi-core processor to Python. Accelerate includes two packages that can be added to your Python installation: NumbaPro and MKL Optimizations. MKL Optimizations makes linear algebra, random number generation, Fourier transforms, and many other operations run faster and in parallel. NumbaPro builds fast GPU and multi-core machine code from easy-to-read Python and NumPy code with a Python-to-GPU compiler.

NumbaPro Features
– NumbaPro compiler targets multi-core CPU and GPUs directly from
    simple Python syntax
– Easily move vectorized NumPy functions to the GPU
– Multiple CUDA device support
– Bindings for CUDA libraries, including cuBlas, cuRand, cuSparse, and cuFFT
– Support for array slicing and fast array math
– Use multiple threads without worrying about the GIL
– Supported on NVIDIA CUDA-enabled GPUs with compute capability 2.0
    or above on Intel/AMD (x86) processors.

your blood pressure increases and the level of endorphins skyrockets. Why? Simply because of the promise to do some tasks faster utilising GPU in a parallel mode! If you are new to GPU or CUDA I recommend you to read some well written posts on Mike’s website, for instance, Installing Nvidia CUDA on Mac OSX for GPU-based Parallel Computing or Monte Carlo Simulations in CUDA – Barrier Option Pricing. You will grasp the essence of what is all about. In general, much ado about CUDA is still around making use of your GPU and proving this extra upmhhh in speedup. If you have any quantitative problem in mind and it can be executed in the parallel mode, NumbaPro is a tool you need to look at but – not every engine sounds the same. Hold on till the end of this post. It will be worth it.

Selling the Speed

When you approach a new concept or a new product and someone tries to sell it to you, he needs to impress you to win your attention and boost your curiosity. Imagine for a moment that you have no idea about GPU or CUDA and you want to add two vectors. In Python you can do it as follows:

import numpy as np
from timeit import default_timer as timer

def VectorAdd(a,b,c):
    for i in xrange(a.size):
        c[i]=a[i]+b[i]

def main():

    N=32000000

    A=np.ones(N, dtype=np.float32)
    B=np.ones(N, dtype=np.float32)
    C=np.zeros(N, dtype=np.float32)

    start=timer()
    VectorAdd(A,B,C)
    totaltime=timer()-start

    print("\nCPU time: %g sec" % totaltime)

if __name__ == '__main__':
    main()

We all know that once you run the code, Python does not compile it, it goes line by line and interprets what it reads. So, we aim at adding two vectors, $A$ and $B$, containing 32 millions of elements. I get $C=A+B$ matrix on my MacBook Pro (2.6 GHz Intel Core i7, 16 GB 1600 MHz DDR3 RAM, NVIDIA GeForce GT 650M 1GB) after:

CPU time: 9.89753 sec

Can we do it better? With NumbaPro the required changes to the code itself are minor. All we need to add is a function decorator that tells how and where the function should be executed. In fact, what NumbaPro does is that it “compiles” VectorAdd function on-the-go and deploys computations to the GPU unit:

import numpy as np
from timeit import default_timer as timer
from numbapro import vectorize

@vectorize(["float32(float32,float32)"], target="gpu")
def VectorAdd(a,b):
    return a+b

def main():

    N=32000000

    A=np.ones(N, dtype=np.float32)
    B=np.ones(N, dtype=np.float32)
    C=np.zeros(N, dtype=np.float32)

    start=timer()
    C=VectorAdd(A,B)
    totaltime=timer()-start

    print("\nGPU time: %g sec" % totaltime)

if __name__ == '__main__':
    main()

We get

GPU time: 0.286101 sec

i.e. 34.6x speed-up. Not bad, right?! Not bad if you’re a sale person indeed! But, hey, what’s that?:

import numpy as np
from timeit import default_timer as timer

def main():

    N=32000000

    A=np.ones(N, dtype=np.float32)
    B=np.ones(N, dtype=np.float32)
    C=np.zeros(N, dtype=np.float32)

    start=timer()
    C=A+B
    totaltime=timer()-start

    print("\nCPU time: %g sec" % totaltime)

if __name__ == '__main__':
    main()

Run it to discover that:

CPU time: 0.0592878 sec

i.e. 4.82x faster than using GPU. Oh, boy! CUDA:NumPy (0:1).

Perfect Pitch

When Nvidia introduced CUDA among some exemplary C codes utilising CUDA programming we could find an immortal Black-Scholes model for option pricing. In this Nobel-prize winning solution, we derive a call option price for non-dividend-paying underlying stock:
$$
C(S,t) = N(d_1)S – N(d_2)Ke^{-r(T-t)}
$$
where $(T-t)$ is the time to maturity (scalar), $r$ is the risk free rate (scalar), $S$ is the spot price of the underlying asset, $K$ is the strike price, and $\sigma$ is the volatility of returns of the underlying asset. $N(\dot)$ is the cumulative distribution function (cnd) of the standard normal distribution and has an analytical form. A classical way to code it in Python is:

import numpy as np
import time

RISKFREE = 0.02
VOLATILITY = 0.30

def cnd(d):
    A1 = 0.31938153
    A2 = -0.356563782
    A3 = 1.781477937
    A4 = -1.821255978
    A5 = 1.330274429
    RSQRT2PI = 0.39894228040143267793994605993438
    K = 1.0 / (1.0 + 0.2316419 * np.abs(d))
    ret_val = (RSQRT2PI * np.exp(-0.5 * d * d) *
               (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5))))))
    return np.where(d > 0, 1.0 - ret_val, ret_val)

def black_scholes(callResult, putResult, stockPrice, optionStrike, optionYears,
                  Riskfree, Volatility):
    S = stockPrice
    X = optionStrike
    T = optionYears
    R = Riskfree
    V = Volatility
    sqrtT = np.sqrt(T)
    d1 = (np.log(S / X) + (R + 0.5 * V * V) * T) / (V * sqrtT)
    d2 = d1 - V * sqrtT
    cndd1 = cnd(d1)
    cndd2 = cnd(d2)

    expRT = np.exp(- R * T)
    callResult[:] = (S * cndd1 - X * expRT * cndd2)

def randfloat(rand_var, low, high):
    return (1.0 - rand_var) * low + rand_var * high

def main (*args):
    OPT_N = 4000000
    iterations = 10
    if len(args) >= 2:
        iterations = int(args[0])
    
    callResult = np.zeros(OPT_N)
    stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0)
    optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0)
    optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0)

    time0 = time.time()
    for i in range(iterations):
        black_scholes(callResult, putResult, stockPrice, optionStrike,
                      optionYears, RISKFREE, VOLATILITY)
    time1 = time.time()
    print("Time: %f msec per option" % ((time1-time0)/iterations/OPT_N*1000))

if __name__ == "__main__":
    import sys
    main(*sys.argv[1:])

what returns

Time: 0.000192 msec per option

The essence of this code is to derive 4 million independent results based on feeding the function with random stock prices, option strike prices, and times to maturity. They enter the game under cover as row vectors with randomised values (see lines #45-47). Anaconda Accelerate’s CUDA solution for the same code is:

import numpy as np
import math
import time
from numba import *
from numbapro import cuda
from blackscholes import black_scholes # save the previous code as
                                       # black_scholes.py

RISKFREE = 0.02
VOLATILITY = 0.30

A1 = 0.31938153
A2 = -0.356563782
A3 = 1.781477937
A4 = -1.821255978
A5 = 1.330274429
RSQRT2PI = 0.39894228040143267793994605993438

@cuda.jit(argtypes=(double,), restype=double, device=True, inline=True)
def cnd_cuda(d):
    K = 1.0 / (1.0 + 0.2316419 * math.fabs(d))
    ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) *
               (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5))))))
    if d > 0:
        ret_val = 1.0 - ret_val
    return ret_val

@cuda.jit(argtypes=(double[:], double[:], double[:], double[:], double[:],
                    double, double))
def black_scholes_cuda(callResult, putResult, S, X,
                       T, R, V):
#    S = stockPrice
#    X = optionStrike
#    T = optionYears
#    R = Riskfree
#    V = Volatility
    i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    if i >= S.shape[0]:
        return
    sqrtT = math.sqrt(T[i])
    d1 = (math.log(S[i] / X[i]) + (R + 0.5 * V * V) * T[i]) / (V * sqrtT)
    d2 = d1 - V * sqrtT
    cndd1 = cnd_cuda(d1)
    cndd2 = cnd_cuda(d2)

    expRT = math.exp((-1. * R) * T[i])
    callResult[i] = (S[i] * cndd1 - X[i] * expRT * cndd2)

def randfloat(rand_var, low, high):
    return (1.0 - rand_var) * low + rand_var * high

def main (*args):
    OPT_N = 4000000
    iterations = 10

    callResultNumpy = np.zeros(OPT_N)
    putResultNumpy = -np.ones(OPT_N)
    stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0)
    optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0)
    optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0)
    callResultNumba = np.zeros(OPT_N)
    putResultNumba = -np.ones(OPT_N)
    callResultNumbapro = np.zeros(OPT_N)
    putResultNumbapro = -np.ones(OPT_N)

# Numpy ----------------------------------------------------------------
    time0 = time.time()
    for i in range(iterations):
        black_scholes(callResultNumpy, putResultNumpy, stockPrice,
                      optionStrike, optionYears, RISKFREE, VOLATILITY)
    time1 = time.time()
    dtnumpy = ((1000 * (time1 - time0)) / iterations)/OPT_N
    print("\nNumpy Time            %f msec per option") % (dtnumpy)

# CUDA -----------------------------------------------------------------
    time0 = time.time()
    blockdim = 1024, 1
    griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1
    stream = cuda.stream()
    d_callResult = cuda.to_device(callResultNumbapro, stream)
    d_putResult = cuda.to_device(putResultNumbapro, stream)
    d_stockPrice = cuda.to_device(stockPrice, stream)
    d_optionStrike = cuda.to_device(optionStrike, stream)
    d_optionYears = cuda.to_device(optionYears, stream)
    
    time2 = time.time()
    
    for i in range(iterations):
        black_scholes_cuda[griddim, blockdim, stream](
            d_callResult, d_putResult, d_stockPrice, d_optionStrike,
            d_optionYears, RISKFREE, VOLATILITY)
        d_callResult.to_host(stream)
        d_putResult.to_host(stream)
        stream.synchronize()
        
    time3 = time.time()
    dtcuda = ((1000 * (time3 - time2)) / iterations)/OPT_N

    print("Numbapro CUDA Time    %f msec per option (speed-up %.1fx)
             \n") % (dtcuda, dtnumpy/dtcuda)
#   print(callResultNumbapro)


if __name__ == "__main__":
    import sys
    main(*sys.argv[1:])

returning

Numpy Time            0.000186 msec per option
Numbapro CUDA Time    0.000024 msec per option (speed-up 7.7x) 

In order to understand why CUDA wins over NumPy this time is not so difficult. First we have a programmable analytical form of the problem. We deploy it to GPU and perform exhaustive calculations involving cnd_cuda function for the estimation of the cumulative distribution function of the standard normal distribution. Splitting the task into many concurrently running threats on GPU reduces the time. Again, it’s possible because all option prices can be computed independently. CUDA:NumPy (1:1).

Multiplied Promises

In finance, the concept of portfolio optimization is well established. The idea standing behind is to find such a vector of weights, $w$, for all assets that the derived estimated portfolio risk ($\sigma_P$) and return ($\mu_P$) meets our needs or expectations.

An alternative (but not greatly recommended) approach would involve optimization through randomisation of $w$ vectors. We could generate a big number of them, say $N$, in order to obtain:
$$
\mu_{P,i} = m w_i^T \ \ \ \mbox{and} \ \ \ \sigma_{P,i} = w_iM_2w_i^T
$$ for $i=1,…,N$. Here, $m$ is a row-vector holding estimated expected returns for all assets in portfolio $P$ and based on return-series we end up with $M_2$ covariance matrix $(MxM$ where $M$ is a number of assets in $P$). In the first case, we aim at the multiplication of row-vector with a transposed row-vector of weight whereas for the latter we perform the multiplication of the row-vector with the square matrix (as the first operation). If $N$ is really big, say a couple of millions, the computation could be somehow accelerated using GPU. Therefore, we need a code for matrix multiplication on GPU in Python.

Let’s consider a more advanced concept of ($K\times K$)$\times$($K\times K$) matrix multiplication. If that works faster, than our random portfolios problem should be even faster. Continuum Analytics provides with a ready-to-use solution:

import numpy as np
from numbapro import cuda
import numba
from timeit import default_timer as timer
from numba import float32

bpg = 32
tpb = 32

n = bpg * tpb

shared_mem_size = (tpb, tpb)
griddim = bpg, bpg
blockdim = tpb, tpb

@numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def naive_matrix_mult(A, B, C):
    x, y = cuda.grid(2)
    if x >= n or y >= n:
        return

    C[y, x] = 0
    for i in range(n):
        C[y, x] += A[y, i] * B[i, x]


@numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def optimized_matrix_mult(A, B, C):

    # Declare shared memory
    sA = cuda.shared.array(shape=shared_mem_size, dtype=float32)
    sB = cuda.shared.array(shape=shared_mem_size, dtype=float32)
    
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    x, y = cuda.grid(2)

    acc = 0
    for i in range(bpg):
        if x < n and y < n:
            # Prefill cache
            sA[ty, tx] = A[y, tx + i * tpb]
            sB[ty, tx] = B[ty + i * tpb, x]

        # Synchronize all threads in the block
        cuda.syncthreads()

        if x < n and y < n:
            # Compute product
            for j in range(tpb):
                acc += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish the computation
        cuda.syncthreads()

    if x < n and y < n:
        C[y, x] = acc
        
        
# Prepare data on the CPU
A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, n)), dtype=np.float32)

print "(%d x %d) x (%d x %d)" % (n, n, n, n)

# Prepare data on the GPU
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.device_array_like(A)

# Time the unoptimized version
s = timer()
naive_matrix_mult[griddim, blockdim](dA, dB, dC)
numba.cuda.synchronize()
e = timer()
unopt_ans = dC.copy_to_host()
tcuda_unopt = e - s

# Time the optimized version
s = timer()
optimized_matrix_mult[griddim, blockdim](dA, dB, dC)
numba.cuda.synchronize()
e = timer()
opt_ans = dC.copy_to_host()
tcuda_opt = e - s

assert np.allclose(unopt_ans, opt_ans)
print "CUDA without shared memory:", "%.2f" % tcuda_unopt, "s"
print "CUDA with shared memory   :", "%.2f" % tcuda_opt, "s"

s = timer()
np.dot(A,B)
e = timer()
npt=e-s
print "NumPy dot product         :", "%.2f" % npt, "s"

what returns

(1024 x 1024) x (1024 x 1024)
CUDA without shared memory: 0.76 s
CUDA with shared memory   : 0.25 s
NumPy dot product         : 0.06 s

and leads to CUDA:NumPy (1:2) score of the game. The natural questions arise. Is it about the matrix size? Maybe it is too simple problem we try to solve it with an improper tool? Or the way how we approach matrix allocation and deployment to GPU itself?

The last question made me digging deeper. In Python you can create one big matrix holding a number of smaller matrices. The following code tries to perform 4 million $(2\times 2)$ matrix multiplications where matrix $B$ is randomised every single time (see our random portfolio problem). In Anaconda Accelerate we achieve it as follows:

import numbapro
import numba.cuda
import numpy as np
from timeit import default_timer as timer
# Use the builtin matrix_multiply in NumPy for CPU test
import numpy.core.umath_tests as ut


@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
                      '(m, n),(n, p)->(m, p)', target='gpu')
def batch_matrix_mult(a, b, c):
    for i in range(c.shape[0]):
        for j in range(c.shape[1]):
            tmp = 0
            for n in range(a.shape[1]):
                 tmp += a[i, n] * b[n, j]
            c[i, j] = tmp


def main():

    n   = 4000000
    dim = 2

    sK=0
    KN=10

    for K in range(KN):

        # Matrix Multiplication:   c = a x b

        a = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
        c = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)

        # NUMPY -------------------------------------------------------------
        start = timer()
        b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
        d=ut.matrix_multiply(a, b)
        np_time=timer()-start

        # CUDA --------------------------------------------------------------
        dc = numba.cuda.device_array_like(c)
        da = numba.cuda.to_device(a)

        start = timer()
        b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
        db = numba.cuda.to_device(b)
        batch_matrix_mult(da, db, out=dc)
        numba.cuda.synchronize()
        dc.copy_to_host(c)
        cuda_time=timer()-start

        sK += np_time/cuda_time

        del da, db

    print("\nThe average CUDA speed-up: %.5fx") % (sK/KN)


if __name__ == '__main__':
    main()

leading us to

The average CUDA speed-up: 0.79003x

i.e. deceleration. Playing with the sizes of matrices and their number may result in error caused by GPU memory required for matrix $A$ and $B$ allocation on GPU. It seems we have CUDA:NumPy (1:3).

Black Magic in Black Box

I approached NumbaPro solution as a complete rookie. I spent a considerable amount of time searching for Anaconda Accelerate’s GPU codes demonstrating massive speed-ups as promised. I found different fragments in different places across the Web. With the best method known among all beginners, namely, copy and paste, I re-ran what I found. Then modified and re-ran again. And again, and again. I felt the need, the need for speed! But failed finding my tail wind.

This post may demonstrate my lack of understanding of what is going on or reveal a blurred picture standing behind: a magic that works if you know all the tricks. I hope that at least you enjoyed the show!




8 comments
  1. Great post Pawel. I really like your writing style.
    Appreciate your effort of testing the libraries and thank you for shearing the results.

  2. I am a developer of NumbaPro and I wrote most of the examples that you referenced. I wanted to point out that the matrix-multiplication kernel was meant for demonstration. It is not fast. You should use the GEMM method from cuBLAS for performance critical application. See example https://gist.github.com/sklam/dc8fce2cf37b82f66a48#file-randomportfolios-py-L92. On a same Macbook Pro model that you have, the cuBLAS GEMM can be 2x faster than MKL-accelerated NumPy dot:

    (1024 x 1024) x (1024 x 1024)
    CUDA without shared memory: 0.765 s
    CUDA with shared memory : 0.251 s
    CUDA BLAS GEMM : 0.010 s
    NumPy dot product : 0.020 s

    In your batch matrix-multiplication code, a small enhancement can improve the speed a lot. Instead of generating the random number on the CPU and copy it to the GPU, generate it with cuRAND on the GPU https://gist.github.com/sklam/dc8fce2cf37b82f66a48#file-batch_matrix_mult-py-L56. The new script can achieve 1.3x-2.5x speedup. (The variation in speed is probably due to dynamic GPU clockrate.)

    I also wanted to point out that the GT 650M GPU is not powerful. On a Linux system with a better GPU (K20c) . I get the following:

    $ python batch_matrix_mult.py

    The average CUDA speed-up: 9.61722x

    $ python randomportfolios.py

    (1024 x 1024) x (1024 x 1024)
    CUDA without shared memory: 0.078 s
    CUDA with shared memory : 0.028 s
    CUDA BLAS GEMM : 0.001 s
    NumPy dot product : 0.025 s

    I understand that NumbaPro still requires too much CUDA knowledge to be useful. The documentation can also be improved. We are designing new features to reduce programmers’ effort in getting optimal results. If you have difficulty speeding-up code, please feel free to post questions on the Numba mailing list.

    1. Siu, many thanks! This is the answer and clarification I’ve been waiting for. Highly appreciated on behalf of all Pythonists :)

  3. Your first example isn’t really a good one. Since numpy operators work by element, you should not write a function to add up arrays, because that makes it really slow. Let the built-in operators handle that;


    import numpy as np
    from timeit import default_timer as timer

    def main():

    N=32000000

    A=np.ones(N, dtype=np.float32)
    B=np.ones(N, dtype=np.float32)
    C=np.zeros(N, dtype=np.float32)

    start=timer()
    C = A + B
    totaltime=timer()-start

    print("nCPU time: %g sec" % totaltime)

    if __name__ == '__main__':
    main()

    On my machine, this gives the following result;

    CPU time: 0.143835 sec

    Your original program took about 18 seconds on my machine…

    1. Hi Roland. You’re correct but I wrote this paragraph the way how the company (CA) promotes its GPU accelerated version across the Web. I haven’t written that block of code. I quoted them and that’s why the title is “Selling the Speed” with the emphasis on selling in the text :-) Cheers!

      1. Makes you wonder why they would use such a contrived example?

        Another avenue that you could pursue is using multiprocessing, to make proper use of multiple cores if you don’t have access to the MKL libraries.

    2. I think most people with a cursory knowledge of python would understand that the first example will be very slow. That seems to be precisely why he did the third example…

      Nevertheless, I can also think of a number of cases where you can’t use parallel processing to speed up loops. For instance, MCMC or where you need to optimize a portfolio on several dates incorporating turnover. Neither of those cases allow the loops to be independent of each other.

      Drawing an efficient frontier might be a better example though.

Leave a Reply

Your email address will not be published. Required fields are marked *