Why isn't numpy using more than one thread for array creation/manipulation?

I know that questions along these lines have been asked before, but all of them seem to be outdated. If I run the code:

import numpy as np

a = np.random.randn(10000, 10000)
b = np.random.randn(10000, 10000)
c = a*b

htop shows that only one core is being used at more or less 100% capacity. So, why doesn't numpy speed up processing by using more than one thread? Is this caused by my instalation of numpy? Or is it because of some dependent library?

I am running this on Fedora 36.

I tried to implement multithreading manually, by splitting arrays and performing operations asynchronously, but the results I got were simply incorrect. Reading through the aforementioned old questions, maybe this has something to do with the BLAS library or its environment variables, but this looks outdated to me: I couldn't find this library in my machine, nor the environment variables.

Any help would be sincerely appreciated.

Upvotes: 1

Views: 594

Answers (1)

Ahmed AEK
Ahmed AEK

Reputation: 17496

it's not really beneficial, multithreading the random generator is mentioned in the docs Multithreaded Random Generation and the correct way to multithread numpy elementwise multiplication is as follows:

import numpy as np
from multiprocessing.pool import ThreadPool
from multiprocessing import cpu_count
import time

a = np.random.randn(10000, 10000)
b = np.random.randn(10000, 10000)

threads = 8

def multiply_wrapper(args):
    np.multiply(args[0], args[1], out=args[2])

with ThreadPool(threads) as pool:
    splits_a = np.array_split(a, threads)
    splits_b = np.array_split(b, threads)
    c = np.empty_like(a)  # to hold the results
    splits_c = np.array_split(c, threads)

    t1 = time.perf_counter()
    pool.map(multiply_wrapper, zip(splits_a, splits_b, splits_c))
    t2 = time.perf_counter()
    c2 = a * b
    t3 = time.perf_counter()

    assert np.all(c == c2)  # both methods are equivalent

    print("threaded version", t2 - t1)
    print("serial version", t3 - t2)
threaded version 0.12851539999246597
serial version 0.23459540004841983

note that even though your arrays are really big (10_000, 10_000) multihreading this operation resulted in only a small performance boost, and this is not taking into account the overhead of spawning threads.

on smaller arrays the overhead of spawning threads will be higher than the time taken to do the calculation itself, so just putting threads everywhere is generally not correct and can hurt performance in a lot of cases.

Upvotes: 2

Related Questions