How to get a multi-threaded dot() function?

Question

When I execute this in the REPL of Julia v0.3.7 on my 64-bit Windows 8.1 PC with 8 logical CPU cores:

blas_set_num_threads(CPU_CORES)
const v=ones(Float64,100000)
@time for k=1:1000000;s=dot(v,v);end

I observe in the CPU meter of Task Manager or Process Explorer that only 12.5% CPU (1 logical CPU core) is being used. I also observe the same thing with Julia v0.3.5, on both Windows 7 and Windows 8.1. I also observe the same behavior by starting up with "Julia -p 8" on the command line. Returning to running the Julia REPL without the "-p 8" command-line option, I tried this test:

blas_set_num_threads(CPU_CORES)
@time peakflops(10000)

The CPU meter shows 100% CPU usage in this case.

Since dot() and peakflops() both use BLAS (OpenBLAS in my case), I expect both to use the number of threads specified by blas_set_num_threads(). However, only the latter function actually does. Is the behavior of dot() due to a bug, maybe in OpenBLAS?

I tried to work around Julia's shortcoming by using the matrix multiply function. However, I am iterating on dot() operations on subvectors of GB-size 2D arrays, where the subvectors use contiguous memory. Matrix multiply forces me to transpose each vector, which creates a copy. This is costly within the inner loop. Therefore, the choice for me seems to be either learn how to use Julia's parallel processing commands/macros or go back to Python (where Intel's MKL BLAS operates as expected for ddot()). Since dot() is the function that consumes 99% of the CPU in the routine I am trying to code, I was hoping that OpenBLAS would give me a user-friendly optimum solution in Julia without having to worry about all the parallel processing complexity under the hood. But maybe it isn't so bad...

I could use some creating a multi-threaded dot() function. Some example code would be the best help. Do I need to use SharedArray when all threads run on the same machine? If so, does converting from Array to SharedArray create a copy? Since my arrays are very large, I don't want two copies in memory simultaneously. Because my vectors are about 100,000 in length, and vectors from one array are utilized in an unpredictable order, the best multi-threaded solution for me would be a dot() function that splits up the task among the available cores and sums the result from each core. How to do that in Julia as efficiently as BLAS?

I saw Tim Holy's answer here: BLAS v. parallel updates for Julia SharedArray objects However, his example (I think) executes an entire dot() function on one core, and it doesn't answer my other questions.

Edit 1: I tried running Julia with the "-p 8" command-line option and replacing dot() in my above example with innersimd() from here: http://docs.julialang.org/en/release-0.3/manual/performance-tips/ Still uses only 1 core. I modified innersimd() to type its arguments as ::Array{Float64, 1} and then ::SharedArray{Float64, 1}, but this still also uses 1 core. :(

Edit 2: I tested Julia's matrix multiplication (BLAS' gemm!() function):

blas_set_num_threads(CPU_CORES)
const A=ones(Float64,(4,100000))
const B=ones(Float64,(100000,4))
@time for k=1:100000;s=A*B;end

Even with Julia started without the "-p" command line option, Julia uses only one core.

Edit 3: Here is comparable test code for Python:

import numpy as np
from scipy.linalg.blas import ddot
from timeit import default_timer as timer
v = np.ones(100000)
start = timer()
for k in range(1000000):
    s = ddot(v,v)
exec_time=(timer() - start)
print
print("Execution took", str(round(exec_time, 3)), "seconds")

I got the same result on 64-bit Anaconda3 v2.1.0 and WinPython: 7.5 seconds. Compare to the first example in my question, using Julia 0.3.7 with OpenBLAS, executed in 28 seconds. That's makes Python about 4 times faster than Julia, which is explained by OpenBLAS' single-threaded implementation of ddot().

Edit 4: I did some testing in Python with effectively executing a (4xN)*(Nx2) matrix multiply in my inner loop (N=100000), and I found it to be more than twice as slow. I suspect that this is because the cache demands are now 8 times larger, so the cache hit rate is worse. To keep the cache demands the same in Julia as Python, the question for Julia is: how to break my 100000-length vectors into 4 chunks and execute OpenBLAS' single-threaded ddot() in parallel (and sum the results)? The length may not be exactly a multiple of 4. Since OpenBLAS would be single-threaded, I can run Julia with the "-p 8" command line argument and multi-thread other custom functions in the same session.

Edit 5: Running Julia v0.3.7 with the "-p 8" command line argument, I observe that OpenBLAS gemm!() function does run as multithreaded (for certain matrix dimensions):

blas_set_num_threads(CPU_CORES)
const a = rand(10000, 10000)
@time a * a

Andreas Noack · Accepted Answer

The reason for this is that OpenBLAS' ddot doesn't seem to be multithreaded. I think they mainly implement multithreading for BLAS-3 functions like xgemm, so if possible, the best to do is to write your problem with matrix multiplications.

It would help if you could give an example of how you tried with matrix multiplication. Maybe the transposition can be avoided. How did you do this in Python? It is all just BLAS, so Julia should be able to do as well as Python here.

Running Julia with the -p 8 flag launches nine copies of Julia and turns off the multithreading in BLAS so it would probably make the situation worse in your case.

Edit 1: I tried your example with OpenBLAS and MKL both in single and multithreading mode and if there is any difference, single threaded is faster. I'm not sure if OpenBLAS uses multiple threads for your example though. It might be that multithreading speedup is hard to achieve for thin problems and therefore they only turn on multithreading for fatter problems. If your example significantly faster in Python when run on the same machine? If so, please provide the Python example such that I can compare to that on my machine.

Edit 2: I can confirm the speedup of ddot in MKL. To be precise, the comment in Edit 1 was about dgemm. There is active work on making Julia multithreaded and when that happens, a threaded Julia implementation could maybe be faster than OpenBLAS' ddot. Your options right now are one of

If possible, rewrite your problem such that OpenBLAS' dgemm uses multiple threads, i.e. with fatter matrices. On my machine multithreading seems to kick in for n=16. If this is possible, I don't think it will be slower than MKL. Usually, OpenBLAS' dgemm compares well with MKL's.
Ask the OpenBLAS developers to make ddot multithreaded. It would better if you could just supply a version to the project, but high performance BLAS code is hard to write.
Buy MKL and start selling Revolution Julia including MKL to cover the costs of the license.

Edit 3: The comparisons were all made in Julia on a Ubuntu machine with Nahalem architecture and 80 cores (but I only used up to 16). I used two different builds of the same development version of Julia: one linked with OpenBLAS build from source and one linked with MKL. I also tried Julia with OpenBLAS on a recent Haswell system to make sure that threading wasn't implemented for ddot on that architecture.

How to get a multi-threaded dot() function?

Answers (1)

Related Questions