Reputation: 5635

numpy.dot 100 times slower than native C++11

I have a Matlab-background, and when I bought a laptop a year ago, I carefully selected one which has a lot of compute power, the machine has 4 threads and it offers me 8 threads at 2.4GHz. The machine proved itself to be very powerful, and using simple parfor-loops I could utilize all the processor threads, with which I got a speedup near 8 for many problems and experiments.

This nice Sunday I was experimenting with numpy, people often tell me that the core business of numpy is implemented efficiently using libblas, and possibly even using multiple cores and libraries like OpenMP (with OpenMP you can create parfor-like loops using c-style pragma's).

This is the general approach for many numerical and machine learning algorithms, you express them using expensive high-level operations like matrix multiplications, but in an expensive, high-level language like Matlab and python for comfort. Moreover, c(++) allows us to bypass the GIL.

So the cool part is that linear algebra-stuff should process really fast in python, whenever you use numpy. You just have the overhead of some function-calling, but then if the calculation behind it is large, that's negligible.

So, without even touching the topic that not everything can be expressed in linear algebra or other numpy operations, I gave it a spin:

t = time.time(); numpy.dot(range(100000000), range(100000000)); print(time.time() - t)
40.37656021118164

So I, these 40 seconds I saw ONE of the 8 threads on my machine working for 100%, and the others were near 0%. I didn't like this, but even with one thread working I'd expect this to run in approximately 0.something seconds. The dot-product does 100M +'es and *'es, so we have 2400M / 100M = 24 clock ticks per second for one +, one * and whatever overhead.

Nevertheless, the algorithm needs 40* 24 =approx= 1000 ticks (!!!!!) for the +, * and overhead. Let's do this in C++:

#include<iostream>

int main() {
  unsigned long long result = 0;
  for(unsigned long long i=0; i < 100000000; i++)
    result += i * i;
  std::cout << result << '\n';
}

BLITZ:

herbert@machine:~$ g++ -std=c++11 dot100M.cc 
herbert@machine:~$ time ./a.out
662921401752298880

real    0m0.254s
user    0m0.254s
sys 0m0.000s

0.254 seconds, almost 100 times faster than numpy.dot.

I thought, maybe the python3 range-generator is the slow part, so I handicapped my c++11 implementation by storing all 100M numbers in a std::vector first (using iterative push_back's), and than iterating over it. This was a lot slower, it took a little below 4 seconds, which still is 10 times faster.

I installed my numpy using 'pip3 install numpy' on ubuntu, and it started compiling for some time, using both gcc and gfortran, moreover I saw mentions of blas-header files passing through the compiler output.

For what reason is numpy.dot so extremely slow?

Upvotes: 0

Answers (2)

JonDeen

Reputation: 41

range generates a list based on your given parameters, where as your for loop in C merely increments a number.

I agree that it seems fairly costly computationally wise to spend so much time on generating one list-- then again, it is a big list, and you're requesting two of them ;-)

EDIT: As mentioned in comments range generates lists, not arrays.

Try substituting your range method with an incrementing while loop or similar and see if you get more tolerable results.

Upvotes: 1

Daniel

Reputation: 42748

So your comparison is unfair. In your python example, you first generate two range objects, convert them to numpy-arrays and then doing the scalar product. The calculation takes the least part. Here are the numbers for my computer:

>>> t=time.time();x=numpy.arange(100000000);numpy.dot(x,x);print time.time()-t
1.28280997276

And without the generation of the array:

>>> t=time.time();numpy.dot(x,x);print time.time()-t
0.124325990677

For completion, the C-version takes roughly the same time:

real    0m0.108s
user    0m0.100s
sys 0m0.007s

Upvotes: 12

numpy.dot 100 times slower than native C++11

Answers (2)

Related Questions