Are ubuntu's BLAS-implementations parallel?

Question

Linux offers great parallelization-tools like pthreads and openmp. They make it relatively easy to parallellize calculations. Even though I am no expert in optimization at assembly level, it seems fairly trivial to implement element-wise operations on large arrays in parallel. Idem for matrix multiplication, solving linear systems, etc.

However, if I install numpy from apt-get (sudo apt-get install python3 python3-dev python3-numpy ipython3), the elementwise operations are performed on just one core. This is unfortunate, since python does not offer parallelization due to the GIL. Having parallelization at the lower level of BLAS would allow me to utilize all cores and get a significant speedup (6-8 for my i7 with 8 threads), ergo less waiting and more experimenting while still working in a comfortable language (python).

This

y = (lambda x: x ** x) ( numpy.arange(1e8) )

simple element-wise x ** x could be run in parallel, yet it runs on just one core for my default ubuntu apt-get numpy installation.

The same goes for theano. I think theano also uses BLAS under the hood, since one of it's dependencies is numpy (which in turn uses BLAS). Whenever I train a keras or theano neural network, just one core is being used.

To the best of my understanding BLAS is just an interface description, not the actual implementation. Nevertheless, in ubuntu there is a libblas3 and libatlas3-base package, which seem to be different.

Furthermore: compiling atlas from source was not a big success: the configure script kept complaining about throttling which seems to be inherent to my CPU which always seems to run 100MHz below the max frequency, and which also has a 2.4GHz max and a possibility to boost to 3.4GHz for half a minute or so - I think this is called thermal throttling.

What are my options regarding multi-core usage of numpy operations for matrices and such? Do they come out of the box? If not, why not? Why is it so hard to provide a linear algebra engine for numpy which is parallel? With 4 or 8 cpu threads, I would expect that assembly-level single-thread optimizations can't beat that.

EDIT

I also tried the following commands:

OMP_NUM_THREADS=8 LD_PRELOAD=/usr/lib/atlas-base/libatlas.so python3 -c 'import numpy; x=numpy.arange(1e8); y=x**x'
OMP_NUM_THREADS=8 LD_PRELOAD=/usr/lib/libopenblas.so python3 -c 'import numpy; x=numpy.arange(1e8); y=x**x'

Which also just use 1 core.

Are ubuntu's BLAS-implementations parallel?

Answers (0)

Related Questions

Are ubuntu&#39;s BLAS-implementations parallel?

Answers (0)

Related Questions

Are ubuntu's BLAS-implementations parallel?