Arash
Arash

Reputation: 21

multithreaded code with released GIL and complex threads is slower in Python

I am working on a 8 core processor machine with 8G RAM and Linux Redhat 7 and I am using Pycharm IDE.

I tried to use python threading module to take the advantage of multi-core processing but I ended up with a much slower code. I released the GIL through Numba and made sure my threads are doing sufficiently complex calculations so the problem is not what discussed in for example How to make numba @jit use all cpu cores (parallelize numba @jit)

Here is the multi threaded code:

l=200

@nb.jit('void(f8[:],f8,i4,f8[:])',nopython=True,nogil=True)
def force(r,ri,i,F):

   sum=0
   for j in range(12):
       if (j != i):
           fij=-4 * (12*1**12/(r[j]-ri)**13-6*1**6/(r[j]-ri)**7)
           sum=sum+fij

   F[i+12]=sum

def ODEfunction(r, t):

   f = np.zeros(2 * 12)

   lbound=-4* (12*1**12/(-0.5*l-r[0])**13-6*1**6/(-0.5*l-r[0])**7)
   rbound=-4* (12*1**12/(0.5*l-r[12-1])**13-6*1**6/(0.5*l-r[12-1])**7)

   f[0:12]=r[12:2*12]

   thlist=[threading.Thread(target=force, args=(r,r[i],i,f)) for i in range(12)]

   for thread in thlist:
       thread.start()
   for thread in thlist:
       thread.join()

   f[12]=f[12]+lbound
   f[2*12-1]=f[2*12-1]+rbound
   return f

And this is the sequential version:

l=200

@nb.autojit()
def ODEfunction(r, t):

   f = np.zeros(2 * 12)
   lbound=-4* (12*1**12/(-0.5*l-r[0])**13-6*1**6/(-0.5*l-r[0])**7)
   rbound=-4* (12*1**12/(0.5*l-r[12-1])**13-6*1**6/(0.5*l-r[12-1])**7)

   f[0:12]=r[12:2*12]

   for i in range(12):
       fi = 0.0
       for j in range(12):
           if (j!=i):
               fij = -4 * (12*1**12/(r[j]-r[i])**13-6*1**6/(r[j]-r[i])**7)
               fi = fi + fij
       f[i+12]=fi
   f[12]=f[12]+lbound
   f[2*12-1]=f[2*12-1]+rbound
   return f

I also thought to attach an image of the system monitor during the run of the multi-threaded and the sequential code:

System Motinor during the run of the multi threaded code

System Motinor during the run of the sequential code

Does anybody know what can be the reason for this inefficiency in threaded code?

Upvotes: 1

Views: 713

Answers (1)

ead
ead

Reputation: 34326

You must be aware, that calling a function in Python is quite costly (compared for example to calling a function in C) even more so calling a numba-jitted function: It must be checked, that the parameters are right (i.e. that they are really float-numpy-arrays you are passing - we will see that it can be factor 5 slower than a normal Python-call).

Let's check the overhead of your jitted function, compared to the work happening in the function:

import numba as nb
import numpy as np

@nb.jit('void(f8[:],f8,i4,f8[:])',nopython=True,nogil=True)
def force(r,ri,i,F):

   sum=0
   for j in range(12):
       if (j != i):
           fij=-4 * (12*1**12/(r[j]-ri)**13-6*1**6/(r[j]-ri)**7)
           sum=sum+fij

   F[i+12]=sum

@nb.jit('void(f8[:],f8,i4,f8[:])',nopython=True,nogil=True)
def nothing(r,ri,i,F):
    pass

def no_jit(r,ri,i,F):
    pass

F=np.zeros(24)
r=np.zeros(12)

And now:

>>>%timeit force(r,1.0,0,F)
706 ns ± 8.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit nothing(r,1.0,0,F)
645 ns ± 5.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>>%timeit no_jit(r,1.0,0,F) #to measure overhead of numba
120 ns ± 6.56 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

So there are basically 90% overhead, which you don't have in the single-threaded mode because the function is "inlined". This is not surprisingly: your for-loop has only 12 iterations - it is just not enough work, for example in the example your have linked the inner loop has 10^10 iterations!

Additionally, there is also some overhead from dispatching the work between threads, my guts say it is even more than the overhead from the jitted-call - but to be sure one should profile the program. These odds are quite hard to beat even with 8 cores!

The largest road-blocker right now is probably the big overhead of the force-call compared to the time spent in the function itself. The analysis above is quite superficial so I don't guarantee there are no other significant issues - but to having bigger chunks of work for force would be a step in the right direction.

Upvotes: 2

Related Questions