Web appendix for the Python, R, Julia and Matlab language comparison in 2022 on Jon Danielsson's ModelsandRisk.org

Web Appendix for numerical language comparison 2022

Web appendix for numerical language comparison 2022

Jon Danielsson
Yuyang Lin

28 July 2022

We compare four numerical languages, Julia, R, Python and MATLAB, in a Vox blog post.

The Computation Speed section of the blog post includes three experiments to evaluate the speed of the four languages. The versions of the languages used are Julia 1.7.3, R 4.2.0, Python 3.9.12 and MATLAB R2022a.

Post publication comment

See discussion on the Julia discourse and the ycombinator pages

After we published the article we received several comments, on these the two pages above and elsewhere, some arguing we could have optimised the code better, (and implicitly by not doing so disadvantaging one of the languages.) We disagree. The code below is written from the point of view of a typical user, like us, who sees a mathematical function and codes it up in the most obvious way. It is possible to make the code faster, as our postscript shows for Julia and quite possibly for the others. But doing so requires either specialist knowledge or considerable research. While that is interesting, it is outside of the scope of this appendix. Furthermore, it will make the code much more complex (as the example below shows).

Donald Knuth in Structured Programming With GoTo Statements, writes "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

Code

All the code can be downloaded here.

Experiment 1: GARCH log-likelihood

The first experiment involved comparing the running times for the calculation of GARCH log-likelihood. This is an iterative and dynamic process that captures a large class of numerical problems encountered in practice. Because the values at any given time t depend on values at t-1, this loop is not vectorisable. The GARCH(1,1) specification is

$$\sigma_t^2= \omega + \alpha y^2_{t-1}+ \beta \sigma_{t-1}^2$$

with its log-likelihood given by

$$\ell=-\frac{T-1}{2} \log(2\pi)-\frac{1}{2}\sum_{t=2}^T \left[ \log \left(\omega+\alpha y_{t-1}^2+\beta\sigma_{t-1}^2\right)+\frac{y_t^2}{\omega+\alpha y_{t-1}^2+\beta\sigma_{t-1}^2} \right]$$

For the purposes of our comparison, we are only concerned with the iterative calculation involved for the second term. Hence, the constant term on the left is ignored.

We coded up the log-likelihood calculation in the four languages and in C, keeping the code as similar as possible between languages. We included the just-in-time compilers Numba for Python. We then simulated a sample of size 10,000, loaded that in, and timed the calculation of the log-likelihood 100 times (wall time), picking the lowest in each case.

C

The fastest calculation is likely to be in C (or FORTRAN)

double likelihood(double o, double a, double b, double h, double *y2, int N){	
    double lik=0;
	for (int j=1;j<N;j++){
		h = o+a*y2[j-1]+b*h;	
		lik += log(h)+y2[j]/h;
	}
    return(lik);   
}

R

R can be used independently, which is likely to be slow

likelihood =function(o,a,b,y2,h,N){
	lik=0
	for(i in 2:N){
		h = o + a * y2[i-1]+ b * h
		lik = lik + log(h) + y2[i]/hs
	}
	return(lik)
}

Python

Python is likely to be quite slow,

def likelihood(hh,o,a,b,y2,N):
    lik = 0.0
    h = hh
    for i in range(1,N):
        h=o+a*y2[i-1]+b*h
        lik += np.log(h) + y2[i]/h
    return(lik)

but will be significantly sped up by using Numba.

from numba import jit
@jit
def likelihood(hh,o,a,b,y2,N):
    lik = 0.0
    h = hh
    for i in range(1,N):
        h=o+a*y2[i-1]+b*h
        lik += np.log(h) + y2[i]/h
    return(lik)

MATLAB

function lik = likelihood(o,a,b,h,y2,N)
    lik=0;
    for i = 2:N
        h=o+a*y2(i-1)+b*h;
        lik=lik+log(h)+y2(i)/h;
    end
end

Julia

Julia was designed for speed so we expected it to be fast:

function likelihood(o, a, b, h, y2, N)
    local lik = 0.0
    for i in 2:N
        h = o+a*y2[i-1]+b*h
        lik += log(h)+y2[i]/h
    end
    return(lik)
end

but will be slightly sped up by using @inbound.

function likelihood(o, a, b, h, y2, N)
    local lik = 0.0
   for i in 2:N
    @inbounds h = o+a*y2[i-1]+b*h
     lik += log(h)+y2[i]/h
    end
    return(lik)
end

Results

All runtime results are presented relative to the fastest (C).

If we only look at the base version of the four languages, Julia comes out at the top. This is unsurprising since it is a compiled language while the other three are interpreted. The speed of base Python for this type of calculation is very slow, being two hundred times slower than our fastest alternative, C.

However, when we speed up Python using Numba, it performances increase significantly, beating Julia. In the case of Numba, we only have to import the package and use the JIT compiler without modifying our Python code, which is convenient.

Post publication p.s.

After our article came out, Jorge Vieyra suggested here some ways to speed up the Julia calculation. It makes use of the @turbo macro, which exploits SIMD.

While we think this proposal is interesting, it is not a fair comparison with the calculations above because they depend on specialist knowledge, and for a fair comparison we would need to provide the same SIMD benefit in the other languages.

Note how @turbo requires rewriting the likelihood function by summing the likelihood separately, which allows the SIMD vectorisation to work.

function likelihood(o, a, b, v, y2)
    N = length(y2)
    h = similar(y2)
    lik = 0.0
    h[begin] = v
    @inbounds for i in 2:N
        h[i] = o + a * y2[i-1] + b * h[i-1]
    end
    @turbo for i in 2:N
        lik += log(h[i]) + y2[i] / h[i]
    end
    return lik
end

Using @turbo speeds this Julia code up to 2.6 times.

Experiment 2: Loading a large database

Our second experiment consisted in loading a very large CSV data set, CRSP. This file is almost 3GB uncompressed, and over 600MB compressed in gzip format.

R

R can read both compressed and uncompressed file using the fread function from the data.table package: {% highlight r %}
require(data.table)

uncompressed <- fread("crsp_daily.csv")

compressed <- fread("crsp_daily.gz")


### Python
Python's _pandas_ is a convenient option to easily import csv files into DataFrames. We used it to read the uncompressed file, and in combination with the gzip package to read the compressed file:
```python 
import pandas as pd
import gzip

uncompressed = pd.read_csv("crsp_daily.csv")

f = gzip.open("crsp_daily.gz")
compressed = pd.read_csv(f)

MATLAB

To date, MATLAB can only read uncompressed files:

uncompressed = readtable('crsp_daily.csv');

Julia

We used Julia's CSV and CodecZlib packages to read both type of files:

using CSV
using CodecZlib

uncompressed = CSV.read("crsp_daily.csv", DataFrame);

compressed =  CSV.read(GzipDecompressorStream(open("crsp_daily.gz")), DataFrame)

Results

All runtime results are presented relative to the fastest, which was R's uncompressed reading time.

R's fread function was the fastest to load both the uncompressed and compressed file. In all applicable cases, reading the uncompressed file was faster than reading the compressed file. MATLAB came out at the bottom, with an uncompressed loading time over five times slower than R, and the inability to load the compressed file.

Experiment 3: Calculating annual mean and standard deviation by year and firm

Our last experiment involved performing group calculations on the large dataset imported in experiment 2. We wanted to evaluate the speed in computing the annual mean and standard deviation by year and firm. To compute the annual value, we first need to create a new line for year, this is not included in the computation.

R

Using R's data.table package:

library(data.table)

R <- data[,list(length(RET), mean(RET), sd(RET)), keyby = list(year, PERMNO)]

Python

Python's pandas allows groupby operations:

import pandas as pd

data.groupby(['PERMNO', 'year'])['RET'].agg(['mean', 'std', 'count'])

MATLAB

MATLAB can perform group by operations on table objects using the grpstats function from the Statistics and Machine Learning Toolbox:


statarray = grpstats(data, {'PERMNO', 'year'}, {'mean', 'std'}, 'DataVars', 'RET');

Julia

Julia's DataFrames package allows for this type of computations:

using Statistics
using DataFrames

R = combine(groupby(data, [:year, :PERMNO]),
   :RET => mean => :m, :RET => std => :s, nrow => :c)

Results

All runtime results are presented relative to the fastest (Julia).

Julia proved its superiority in speed due to being a compiled language and was the fastest of the four. It was closely followed by Python and R. MATLAB was by far the worst.

Conclusion

When comparing the four languages without compiler packages like Numba, Julia proved itself superior in computations, like the GARCH log-likelihood and the group by statistics, while R's fread function was unparalleled in loading large files.

When we introduce compiler packages, the picture changes.

Python's Numba package allows for efficient just-in-time compiling with minimal additional code — implementing Numba led to the calculation here being over 200 times faster. However, Numba can only be used in relatively simple cases and can not be considered a good general solution to the slow numerical speed of Python. We could have used Cython, but it is much more involved than Numba.