Web Appendix for numerical language comparison 2020
Web appendix for numerical language comparison 2020
Alvaro Aguirre
Jon Danielsson
20 August 2020
Updated 26 August 2020
We compared four numerical languages, Julia, R, Python and Matlab, in a Vox blog post.
The Execution Speed section of the blog post includes three experiments to evaluate the speed of the four languages. The versions of the languages used are Julia 1.5.0, R 4.0.2, Python 3.7.6 and Matlab R2020a. All code ran on a late model MacBook Pro, Intel i9 with 64 Gb RAM.
- GARCH log-likelihood
- Loading large dataset
- Calculating annual mean and sd by year by PERMNO
- Julia improvements
- Python matrix multiplication
- Conclusions
Experiment 1: GARCH log-likelihood
The first experiment involved comparing the running times for the calculation of GARCH log-likelihood. This is an iterative and dynamic process that captures a large class of numerical problems encountered in practice. Because the values at any given time t depend on values at t-1, this loop is not vectorisable. The GARCH(1,1) specification is
$$\sigma_t^2= \omega + \alpha y^2_{t-1}+ \beta \sigma_{t-1}^2$$
with its log-likelihood given by
$$\ell=-\frac{T-1}{2} \log(2\pi)-\frac{1}{2}\sum_{t=2}^T \left[ \log \left(\omega+\alpha y_{t-1}^2+\beta\sigma_{t-1}^2\right)+\frac{y_t^2}{\omega+\alpha y_{t-1}^2+\beta\sigma_{t-1}^2} \right]$$
For the purposes of our comparison, we are only concerned with the iterative calculation involved for the second term. Hence, the constant term on the left is ignored.
We coded up the log-likelihood calculation in the four languages and in C, keeping the code as similar as possible between languages. We included the just-in-time compilers Numba for Python, and the C++ integration Rcpp for R. We then simulated a sample of size 10,000, loaded that in, and timed the calculation of the log-likelihood 100 times (wall time), picking the lowest in each case.
The code can be downloaded here.
C
The fastest calculation is likely to be in C (or FORTRAN)
gcc -march=native -ffast-math -Ofast run.c
double likelihood(double o, double a, double b, double h, double *y2, int N){
double lik=0;
for (int j=1;j<N;j++){
h = o+a*y2[j-1]+b*h;
lik += log(h)+y2[j]/h;
}
return(lik);
}
R
R can be used independently, which is likely to be slow
likelihood =function(o,a,b,y2,h,N){
lik=0
for(i in 2:N){
h = o + a * y2[i-1]+ b * h
lik = lik + log(h) + y2[i]/h
}
return(lik)
}
but using Rcpp will make it much faster by integrating R with C++ and compiling the likelihood function
require(Rcpp)
a = "double likelihood(double o, double a, double b, double h, NumericVector y2, int N){
double lik=0;
for (int j=1;j<N;j++){
h = o+a*y2[j-1]+b*h;
lik += log(h)+y2[j]/h;
}
return(lik);
}"
cppFunction(a)
Python
Python is likely to be quite slow,
def likelihood(hh,o,a,b,y2,N):
lik = 0.0
h = hh
for i in range(1,N):
h=o+a*y2[i-1]+b*h
lik += np.log(h) + y2[i]/h
return(lik)
but will be significantly sped up by using Numba.
from numba import jit
@jit
def likelihood(hh,o,a,b,y2,N):
lik = 0.0
h = hh
for i in range(1,N):
h=o+a*y2[i-1]+b*h
lik += np.log(h) + y2[i]/h
return(lik)
MATLAB
function lik = likelihood(o,a,b,h,y2,N)
lik=0;
for i = 2:N
h=o+a*y2(i-1)+b*h;
lik=lik+log(h)+y2(i)/h;
end
end
Julia
Julia was designed for speed so we expected it to be fast:
function likelihood(o, a, b, h, y2, N)
local lik = 0.0
for i in 2:N
h = o+a*y2[i-1]+b*h
lik += log(h)+y2[i]/h
end
return(lik)
end
Results
All runtime results are presented relative to the fastest (C).
If we only look at the base version of the four languages, Julia comes out at the top. This is unsurprising since it is a compiled language while the other three are interpreted. The speed of base Python for this type of calculation is very slow, being almost three hundred times slower than our fastest alternative, C.
However, when we speed up R and Python using Rcpp and Numba respectively, their performance increase significantly, beating Julia. In the case of Rcpp this involved compiling a C function in R, whereas in Numba we only have to import the package and use the JIT compiler without modifying our Python code, which is convenient.
Experiment 2: Loading a large database
Our second experiment consisted in loading a very large CSV data set, CRSP. This file is almost 8GB uncompressed, and over 1GB compressed in gzip format. The code for this experiment and the next one can be downloaded here.
R
R can read both compressed and uncompressed file using the fread function from the data.table package:
require(data.table)
uncompressed <- fread("crsp_daily.csv")
compressed <- fread("crsp_daily.gz")
Python
Python's pandas is a convenient option to easily import csv files into DataFrames. We used it to read the uncompressed file, and in combination with the gzip package to read the compressed file:
import pandas as pd
import gzip
uncompressed = pd.read_csv("crsp_daily.csv")
f = gzip.open("crsp_daily.gz")
compressed = pd.read_csv(f)
MATLAB
To date, MATLAB can only read uncompressed files:
uncompressed = readtable('crsp_daily.csv');
Julia
We used Julia's CSV and GZip packages to read both type of files:
using CSV
using GZip
uncompressed() = CSV.read("crsp_daily.csv");
compressed() = GZip.open("crsp_daily.gz","r") do io
CSV.read(io)
end
Results
All runtime results are presented relative to the fastest, which was R's uncompressed reading time. All code ran on a late model MacBook Pro.
R's fread function was the fastest to load both the uncompressed and compressed file. In all applicable cases, reading the uncompressed file was faster than reading the compressed file, although this difference was much more significant in the case of Julia. Matlab came out at the bottom, with an uncompressed loading time over five times slower than R, and the inability to load the compressed file.
Experiment 3: Calculating annual mean and standard deviation by year and firm
Our last experiment involved performing group calculations on the large dataset imported in experiment 2. We wanted to evaluate the speed in computing the annual mean and standard deviation by year and firm.
R
Using R's data.table package:
require(data.table)
R <- data[,list(length(RET), mean(RET), sd(RET)), keyby = list(y, PERMNO)]
Python
Python's pandas allows groupby operations:
import pandas as pd
R = data.groupby(['PERMNO', 'year'])['RET'].agg(['mean', 'std', 'count'])
MATLAB
MATLAB can perform group by operations on table objects using the grpstats function from the Statistics and Machine Learning Toolbox:
import pandas as pd
statarray = grpstats(data, {'PERMNO', 'year'}, {'mean', 'std'}, 'DataVars', 'RET');
Julia
Julia's DataFrames package allows for this type of computations:
using Statistics
using DataFrames
R = by(data, [:year, :PERMNO]) do data
DataFrame(m = mean(data.RET), s = std(data.RET), c = length(data.RET))
end
Results
All runtime results are presented relative to the fastest (Julia).
Julia proved its superiority in speed due to being a compiled language and was the fastest of the four. It was closely followed by R and Python. Matlab was by far the worst.
Julia improvements
After the article was published, Bogumił Kamiński, developer of Julia’s DataFrames package, suggested we take a different approach for the Julia calculation by using combine()
instead. Doing so necessitated fixing a bug, implemented in the v0.21.7 release of DataFrames on August 25th, 2020.
We were able to do the calculation with:
using Statistics
using DataFrames
R = combine(groupby(data, [:year, :PERMNO]),
:RET => mean => :m, :RET => std => :s, nrow => :c)
The results of this was a decrease in processing time to almost half, even when Julia was already the fastest language for this task, as shown in the plot below.
Python matrix multiplication
In the article we mention that if you want to multiply two matrices, X and Y, using Python, you would have to do:
import numpy as np
np.matmul(X,Y)
From Python 3.5 onwards, this can also be done via the operator @:
import numpy as np
X @ Y
Conclusion
When comparing the four languages without compiler packages like Numba or Rcpp, Julia proved itself superior in computations, like the GARCH log-likelihood and the group by statistics, while R's fread function was unparalleled in loading large files.
When we introduce compiler packages, the picture changes.
Python's Numba package allows for efficient just-in-time compiling with minimal additional code — implementing Numba led to the calculation here being over 200 times faster. However, Numba can only be used in relatively simple cases and can not be considered a good general solution to the slow numerical speed of Python. We could have used Cython, but it is much more involved than Numba.
By coding the likelihood function in c and using R's Rcpp, we obtained the fastest speed in calculating GARCH log-likelihood relative to C. We could, of course, do the same for the other three languages.