user15964
user15964

Reputation: 2639

Weird bug in Pandas and Numpy regarding multithreading

Most of the Numpy's function will enable multithreading by default.

for example, I work on a 8-cores intel cpu workstation, if I run a script

import numpy as np    
x=np.random.random(1000000)
for i in range(100000):
    np.sqrt(x)

the linux top will show 800% cpu usage during running like enter image description here Which means numpy automatically detects that my workstation has 8 cores, and np.sqrt automatically use all 8 cores to accelerate computation.

However, I found a weird bug. If I run a script

import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.random((10,10)))
df+df
x=np.random.random(1000000)
for i in range(100000):
    np.sqrt(x)

the cpu usage is 100%!!. enter image description here It means that if you plus two pandas DataFrame before running any numpy function, the auto multithreading feature of numpy is gone without any warning! This is absolutely not reasonable, why would Pandas dataFrame calculation affect Numpy threading setting? Is it a bug? How to work around this?


PS:

I dig further using Linux perf tool.

running first script shows

enter image description here

While running second script shows

enter image description here

So both script involves libmkl_vml_avx2.so, while the first script involves additional libiomp5.so which seems to be related to openMP.

And since vml means intel vector math library, so according to vml doc I guess at least below functions are all automatically multithreaded

enter image description here

Upvotes: 27

Views: 3481

Answers (3)

ead
ead

Reputation: 34316

Pandas uses numexpr under the hood to calculate some operations, and numexpr sets the maximal number of threads for vml to 1, when it is imported:

# The default for VML is 1 thread (see #39)
set_vml_num_threads(1)

and it gets imported by pandas when df+df is evaluated in expressions.py:

from pandas.core.computation.check import _NUMEXPR_INSTALLED

if _NUMEXPR_INSTALLED:
   import numexpr as ne

However, Anaconda distribution also uses vml-functionality for such functions as sqrt, sin, cos and so on - and once numexpr set the maximal number of vml-threads to 1, the numpy-functions no longer use parallelization.

The problem can be easily seen in gdb (using your slow script):

>>> gdb --args python slow.py
(gdb) b mkl_serv_domain_set_num_threads
function "mkl_serv_domain_set_num_threads" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (mkl_serv_domain_set_num_threads) pending.
(gbd) run
Thread 1 "python" hit Breakpoint 1, 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt 
#0  0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1  0x00007fffe978026c in _set_vml_num_threads(_object*, _object*) () from /home/ed/anaconda37/lib/python3.7/site-packages/numexpr/interpreter.cpython-37m-x86_64-linux-gnu.so
#2  0x00005555556cd660 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:694
...
(gdb) print $rdi
$1 = 1

i.e. we can see, numexpr sets number of threads to 1. Which is later used when vml-sqrt function is called:

(gbd) b mkl_serv_domain_get_max_threads
Breakpoint 2 at 0x7fffee65a900
(gdb) (gdb) c
Continuing.

Thread 1 "python" hit Breakpoint 2, 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt
#0  0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1  0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#2  0x00007fffedf78563 in vdSqrt () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_lp64.so
#3  0x00007ffff5ac04ac in trivial_two_operand_loop () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so

So we can see numpy uses vml's implementation of vdSqrt which utilizes mkl_vml_serv_threader_d_1i_1o to decide whether calculation should be done in parallel and it looks the number of threads:

(gdb) fin
Run till exit from #0  0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) print $rax
$2 = 1

the register %rax has the maximal number of threads and it is 1.

Now we can use numexpr to increase the number of vml-threads, i.e.:

import numpy as np
import numexpr as ne
import pandas as pd
df=pd.DataFrame(np.random.random((10,10)))
df+df

#HERE: reset number of vml-threads
ne.set_vml_num_threads(8)

x=np.random.random(1000000)
for i in range(10000):
    np.sqrt(x)     # now in parallel

Now multiple cores are utilized!

Upvotes: 16

Andrew Atrens
Andrew Atrens

Reputation: 509

Looking at numpy, it looks like, under the hood it has had on/off issues with multithreading, and depending on what version you are using you may expect to may start to see crashes when you bump up ne.set_vml_num_threads() ..

http://numpy-discussion.10968.n7.nabble.com/ANN-NumExpr-2-7-0-Release-td47414.html

I need to get my head around how this is glued in to the python interpreter, given your code example where it seems to be somehow allowing multiple apparently synchronous/ordered calls to np.sqrt() to proceed in parallel. I guess if python interpreter is always just returning a reference to an object when it pops the stack, and in your example is just pitching those references and not assigning or manipulating them in any way it would be fine. But if subsequent loop iterations depend on previous ones then it seems less clear how these could be safely parallelized. Arguably silent failure / wrong results is an outcome worse than crashes.

Upvotes: 2

Andrew Atrens
Andrew Atrens

Reputation: 509

I think that your initial premise may be incorrect -

You stated: Which means numpy automatically detects that my workstation has 8 cores, and np.sqrt automatically use all 8 cores to accelerate computation.

A single function np.sqrt() cannot guess how it will next be invoked or return before it has partially completed. There are parallelism mechanisms in python, but none are automatic.

Now, having said that, the python interpreter may be able to optimize the for loop for parallelism, which may be what you are seeing, but I strongly suspect if you look at the wall-clock time for this loop to execute it will be no different regardless if you are (apparently) using 8 cores or 1 core.

UPDATE: Having read a bit more of the comments it seems as though the multi-core behavior you are seeing is related to the anaconda distribution of the python interpreter. I took a look but was unable to find any source code for it, but it seems that the python license permits entities (like anaconda.com) to compile and distribute derivatives of the interpreter without requiring their changes to be published.

I guess that you can reach out to the anaconda folks - the behaviour you are seeing will be difficult to figure out without knowing what/if anything they've changed in the interpreter ..

Also do a quick check of the wall clock time with/without the optimization to see if it is indeed 8x faster - even if you've really got all 8 cores working instead of 1 it would be good to know if the results are actually 8x faster or if there are spinlocks in use which are still serializing on a single mutex.

Upvotes: 0

Related Questions