Reputation: 5635
I have a Matlab-background, and when I bought a laptop a year ago, I carefully selected one which has a lot of compute power, the machine has 4 threads and it offers me 8 threads at 2.4GHz. The machine proved itself to be very powerful, and using simple parfor-loops I could utilize all the processor threads, with which I got a speedup near 8 for many problems and experiments.
This nice Sunday I was experimenting with numpy, people often tell me that the core business of numpy is implemented efficiently using libblas, and possibly even using multiple cores and libraries like OpenMP (with OpenMP you can create parfor-like loops using c-style pragma's).
This is the general approach for many numerical and machine learning algorithms, you express them using expensive high-level operations like matrix multiplications, but in an expensive, high-level language like Matlab and python for comfort. Moreover, c(++) allows us to bypass the GIL.
So the cool part is that linear algebra-stuff should process really fast in python, whenever you use numpy. You just have the overhead of some function-calling, but then if the calculation behind it is large, that's negligible.
So, without even touching the topic that not everything can be expressed in linear algebra or other numpy operations, I gave it a spin:
t = time.time(); numpy.dot(range(100000000), range(100000000)); print(time.time() - t)
40.37656021118164
So I, these 40 seconds I saw ONE of the 8 threads on my machine working for 100%, and the others were near 0%. I didn't like this, but even with one thread working I'd expect this to run in approximately 0.something seconds. The dot-product does 100M +'es and *'es, so we have 2400M / 100M = 24 clock ticks per second for one +, one * and whatever overhead.
Nevertheless, the algorithm needs 40* 24 =approx= 1000 ticks (!!!!!) for the +, * and overhead. Let's do this in C++:
#include<iostream>
int main() {
unsigned long long result = 0;
for(unsigned long long i=0; i < 100000000; i++)
result += i * i;
std::cout << result << '\n';
}
BLITZ:
herbert@machine:~$ g++ -std=c++11 dot100M.cc
herbert@machine:~$ time ./a.out
662921401752298880
real 0m0.254s
user 0m0.254s
sys 0m0.000s
0.254 seconds, almost 100 times faster than numpy.dot.
I thought, maybe the python3 range-generator is the slow part, so I handicapped my c++11 implementation by storing all 100M numbers in a std::vector first (using iterative push_back's), and than iterating over it. This was a lot slower, it took a little below 4 seconds, which still is 10 times faster.
I installed my numpy using 'pip3 install numpy' on ubuntu, and it started compiling for some time, using both gcc and gfortran, moreover I saw mentions of blas-header files passing through the compiler output.
For what reason is numpy.dot so extremely slow?
Upvotes: 0
Views: 1676
Reputation: 41
range
generates a list based on your given parameters, where as your for
loop in C merely increments a number.
I agree that it seems fairly costly computationally wise to spend so much time on generating one list-- then again, it is a big list, and you're requesting two of them ;-)
EDIT: As mentioned in comments range
generates lists, not arrays.
Try substituting your range
method with an incrementing while
loop or similar and see if you get more tolerable results.
Upvotes: 1
Reputation: 42748
So your comparison is unfair. In your python example, you first generate two range objects, convert them to numpy-arrays and then doing the scalar product. The calculation takes the least part. Here are the numbers for my computer:
>>> t=time.time();x=numpy.arange(100000000);numpy.dot(x,x);print time.time()-t
1.28280997276
And without the generation of the array:
>>> t=time.time();numpy.dot(x,x);print time.time()-t
0.124325990677
For completion, the C-version takes roughly the same time:
real 0m0.108s
user 0m0.100s
sys 0m0.007s
Upvotes: 12