Reputation: 450
Why is this Python NumPy code,
import numpy as np
import time
k_max = 40000
N = 10000
data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)
t1 = time.time()
for k in xrange(1,k_max+1):
cos_k = np.cos(k*data[0,:])
sin_k = np.sin(k*data[0,:])
coefs[k-1,0] = (data[1,-1]-data[1,0]) + np.sum(data[1,:-1]*(cos_k[:-1] - cos_k[1:]))
coefs[k-1,1] = np.sum(data[1,:-1]*(sin_k[:-1] - sin_k[1:]))
t2 = time.time()
print('Time:')
print(t2-t1)
faster than the following C++ code?
#include <cstdio>
#include <iostream>
#include <cmath>
#include <time.h>
using namespace std;
// consts
const unsigned int k_max = 40000;
const unsigned int N = 10000;
int main()
{
time_t start, stop;
double diff;
// table with data
double data1[ N ];
double data2[ N ];
// table of results
double coefs1[ k_max ];
double coefs2[ k_max ];
// main loop
time( & start );
for( unsigned int j = 1; j<N; j++ )
{
for( unsigned int i = 0; i<k_max; i++ )
{
coefs1[ i ] += data2[ j-1 ]*(cos((i+1)*data1[ j-1 ]) - cos((i+1)*data1[ j ]));
coefs2[ i ] += data2[ j-1 ]*(sin((i+1)*data1[ j-1 ]) - sin((i+1)*data1[ j ]));
}
}
// end of main loop
time( & stop );
// speed result
diff = difftime( stop, start );
cout << "Time: " << diff << " seconds";
return 0;
}
The first one shows: "Time: 8 seconds" while the second: "Time: 11 seconds"
I know that NumPy is written in C, but I would still think that C++ example would be faster. Am I missing something? Is there a way to improve the C++ code (or the Python one)?
I have changed the C++ code (dynamical tables to static tables) as suggested in one of the comments. The C++ code is faster now, but still much slower than the Python version.
I have changed from debug to release mode and increased 'k' from 4000 to 40000. Now NumPy is just slightly faster (8 seconds to 11 seconds).
Upvotes: 21
Views: 31671
Reputation: 1583
The Python code can't be faster than properly-coded C++ code since Numpy is coded in C, which is about as fast as C++. There are some cases when C++ outperforms C due to the language offering optimizations that are impossible in C. Any other speed differences between C and C++ will come down to the compiler you use and the nature of the code being compiled. Comparing languages for speed is a complex topic.
When someone uses NumPy, they set the problem up by writing Python code. That Python code executes at a crawl compared to tuned C or C++ code. However, since the majority of the execution time takes place crunching the numbers in the C-written NumPy libraries, the extra time to set the problem up in Python is irrelevant. If you wrote the same thing entirely in C/C++, your total execution time might be 1% faster as a guess, because whether you wrote that code that sets up the problem in C/C++ or Python, the C-written library will account for the majority of the execution time as it performs the requested calculations. For this reason, people enjoy the niceties of Python while benefiting from the speed of C where it matters.
As an example, let's say you setup a calculation where you want to multiply two square matrices with 1 million entries each. The Python code will execute slowly but in a tiny amount of total time. Then, the C libraries will perform that tremendous matrix multiplication, accounting for the majority of the execution time.
If you are comparing pure Python to tuned C or C++, expect Python to be anywhere between 10 to 1000 times slower than C or C++, depending on what task your program is doing. Pure Python will also tend to use more memory as well. So say you wrote that huge matrix multiplication in only Python rather than using NumPy. The execution time would be tremendous compared to using C/C++ or using NumPy.
look at the Benchmark Game where people submit solutions to various algorithms in various languages, and the website keeps track of the fastest submissions for each (algorithm, language) pair. You can even view the source code for each submission. For most test cases, Python is 2-15 times slower than C++. And if I'm not mistaken, the top Python solutions end up calling compiled C or C++ libraries to do the majority of the heavy lifting, so the test isn't really comparing pure Python to C++. This same thing happens in other languages e.g. the top Java solutions tend to call code written in C/C++ in some of the problems.
There are many reasons why C/C++ are faster than pure Python.
int
is a tiny chunk of 1s and 0s directly representing the number. It is much faster to pass around and use this barebones representation of an integer rather than deal with objects that contain that representation. Let's say you want to check if two integers are equal. In C/C++, machine code does a quick, single assembly instruction to compare two integers located here and there. In Python, a method from the Integer
class is called. This entails Python needs to look up the type of the object you're actually referencing, look up the name of the property on that object, check that property actually is a method, compile the method if it hasn't been jit'd, then finally it can jump to the code in the comparison method. if jit isn't enabled then the program starts executing that method's code line by line. This is also related to Python being a dynamic while C/C++ is static. Dynamic roughly means things are looked up at runtime whereas static means things are figured out at compile time and the answer is written directly into the machine code.n
integers in C/C++ is about as fast as you can iterate over n
things as possible whereas you will only iterate quickly over the n
pointers quickly in Python. You will then access random chunks of memory in the heap that hold the integer in question, and the n
integers may not even be stored next to each other in the heap.Upvotes: 2
Reputation: 2289
I am actually surprised that no one mentioned Linear Algebra libraries like BLAS LAPACK MKL and all...
Numpy is using complex Linear Algebra libraries ! Essentially, Numpy is most of the time not built on pure c/cpp/fortran code... it is actually built on complex libraries that take advantage of the most performant algorithms and ideas to optimise the code. These complex libraries are hardly matched by naive implementation of classic linear algebra computations. The simplest first example of improvement is the blocking trick.
I took the following image from the CSE lab of ETH, where they compare matrix vector multiplication for different implementation. The y-axis represents the intensity of computations (in GFLOPs); long story short, it is how fast the computations are done. The x-axis is the dimension of the matrix.
C and C++ are fast languages, but actually if you want to mimic the speed of these libraries, you might have to go one step deeper and use either Fortran or intrinsics instructions (that are perhaps the closest to assembly code you can do in C++).
Consider the question Benchmarking (python vs. c++ using BLAS) and (numpy), where the very good answer from @Jfs, and we observe: "There is no difference between C++ and numpy on my machine."
Some more reference:
Why is a naïve C++ matrix multiplication 100 times slower than BLAS?
Upvotes: 13
Reputation: 1035
I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fast", but this doesn't explain why C should be slower than C with additional layer (even a thin one).
The answer is: your C++ code is not slower than your Python code when properly compiled.
I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.
I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main
, to compile my C code I used gcc -O3 main.c -lm -o main
. In all examples I filled data
variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64
), because there are doubles in C++ example. My pure C version of your code (it's similar):
#include <math.h>
#include <stdio.h>
#include <time.h>
const int k_max = 100000;
const int N = 10000;
int main(void)
{
clock_t t_start, t_end;
double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
int z;
for( z = 0; z < N; z++ )
{
data1[z] = 0.1;
data2[z] = 0.4;
}
int i, j;
t_start = clock();
for( i = 0; i < k_max; i++ )
{
for( j = 0; j < N-1; j++ )
{
coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
}
}
t_end = clock();
seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("Time: %f s\n", seconds);
return coefs1[0];
}
For k_max = 100000, N = 10000
results where following:
Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.
For k_max = 1000000, N = 1000
we have:
For k_max = 1000000, N = 100
:
So the difference increases with fraction k_max/N
, but python is not faster even for N
much bigger than k_max
, e. g. k_max = 100, N = 100000
:
Obviously, the main speed difference between C/C++ and Python is in the for
loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.
Python script:
import numpy as np
from time import time
N = 10000
x_len = 100000
def main():
x = np.ones(x_len, dtype=np.float64) * 1.2345
start = time()
for i in xrange(N):
y1 = np.cos(x, dtype=np.float64)
end = time()
print('cos: {} s'.format(end-start))
start = time()
for i in xrange(N):
y2 = x * 7.9463
end = time()
print('multi: {} s'.format(end-start))
start = time()
for i in xrange(N):
res = np.sum(x, dtype=np.float64)
end = time()
print('sum: {} s'.format(end-start))
return y1, y2, res
if __name__ == '__main__':
main()
# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s
C script:
#include <math.h>
#include <stdio.h>
#include <time.h>
const int N = 10000;
const int x_len = 100000;
int main()
{
clock_t t_start, t_end;
double x[x_len], y1[x_len], y2[x_len], res, time;
int i, j;
for( i = 0; i < x_len; i++ )
{
x[i] = 1.2345;
}
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y1[i] = cos(x[i]);
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("cos: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y2[i] = x[i] * 7.9463;
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("multi: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
res = 0.0;
for( i = 0; i < x_len; i++ )
{
res += x[i];
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("sum: %f s\n", time);
return y1[0], y2[0], res;
}
// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s
Python results:
C results:
As you can see NumPy is incredibly fast, but always a bit slower than pure C.
Upvotes: 64
Reputation: 490808
On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).
I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:
#include <cstdio>
#include <iostream>
#include <cmath>
#include <chrono>
#include <vector>
#include <assert.h>
const unsigned int k_max = 40000;
const unsigned int N = 10000;
template <class T>
class matrix2 {
std::vector<T> data;
size_t cols;
size_t rows;
public:
matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
T &operator()(size_t y, size_t x) {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
T operator()(size_t y, size_t x) const {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
};
int main() {
matrix2<double> data(N, 2);
matrix2<double> coeffs(k_max, 2);
using namespace std::chrono;
auto start = high_resolution_clock::now();
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
auto end = high_resolution_clock::now();
std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
}
This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.
The next obvious step would be to use multiple cores. To do that in C++, we can add this line:
#pragma omp parallel for
...before the outer for
loop:
#pragma omp parallel for
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
With -openmp
added to the compiler's command line (though the exact flag depends on the compiler you're using, of course), this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).
Upvotes: 7
Reputation: 39
I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs, hence should switch your for-loops. If this is the case, you should have the following:
#include <iostream>
#include <cmath>
#include <time.h>
const int k_max = 40000;
const int N = 10000;
double cos_k, sin_k;
int main(int argc, char const *argv[])
{
time_t start, stop;
double data[2][N];
double coefs[k_max][2];
time(&start);
for(int i=0; i<k_max; ++i)
{
for(int j=0; j<N; ++j)
{
coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
}
}
// End of main loop
time(&stop);
// Speed result
double diff = difftime(stop, start);
std::cout << "Time: " << diff << " seconds" << std::endl;
return 0;
}
Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3, while Python code runs at 7.816 seconds.
Upvotes: 3