Reputation: 450

Why is my Python NumPy code faster than C++?

Why is this Python NumPy code,

import numpy as np
import time

k_max = 40000
N = 10000

data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)

t1 = time.time()
for k in xrange(1,k_max+1):
    cos_k = np.cos(k*data[0,:])
    sin_k = np.sin(k*data[0,:])
    coefs[k-1,0] = (data[1,-1]-data[1,0]) + np.sum(data[1,:-1]*(cos_k[:-1] - cos_k[1:]))
    coefs[k-1,1] = np.sum(data[1,:-1]*(sin_k[:-1] - sin_k[1:]))
t2 = time.time()

print('Time:')
print(t2-t1)

faster than the following C++ code?

#include <cstdio>
#include <iostream>
#include <cmath>
#include <time.h>

using namespace std;

// consts
const unsigned int k_max = 40000;
const unsigned int N = 10000;

int main()
{
    time_t start, stop;
    double diff;
    // table with data
    double data1[ N ];
    double data2[ N ];
    // table of results
    double coefs1[ k_max ];
    double coefs2[ k_max ];
    // main loop
    time( & start );
    for( unsigned int j = 1; j<N; j++ )
    {
        for( unsigned int i = 0; i<k_max; i++ )
        {
            coefs1[ i ] += data2[ j-1 ]*(cos((i+1)*data1[ j-1 ]) - cos((i+1)*data1[ j ]));
            coefs2[ i ] += data2[ j-1 ]*(sin((i+1)*data1[ j-1 ]) - sin((i+1)*data1[ j ]));
        }
    }
    // end of main loop
    time( & stop );
    // speed result
    diff = difftime( stop, start );
    cout << "Time: " << diff << " seconds";
    return 0;
}

The first one shows: "Time: 8 seconds" while the second: "Time: 11 seconds"

I know that NumPy is written in C, but I would still think that C++ example would be faster. Am I missing something? Is there a way to improve the C++ code (or the Python one)?

Version 2 of the code

I have changed the C++ code (dynamical tables to static tables) as suggested in one of the comments. The C++ code is faster now, but still much slower than the Python version.

Version 3 of the code

I have changed from debug to release mode and increased 'k' from 4000 to 40000. Now NumPy is just slightly faster (8 seconds to 11 seconds).

Upvotes: 21

Answers (5)

user904963

Reputation: 1583

The Python code can't be faster than properly-coded C++ code since Numpy is coded in C, which is about as fast as C++. There are some cases when C++ outperforms C due to the language offering optimizations that are impossible in C. Any other speed differences between C and C++ will come down to the compiler you use and the nature of the code being compiled. Comparing languages for speed is a complex topic.

When someone uses NumPy, they set the problem up by writing Python code. That Python code executes at a crawl compared to tuned C or C++ code. However, since the majority of the execution time takes place crunching the numbers in the C-written NumPy libraries, the extra time to set the problem up in Python is irrelevant. If you wrote the same thing entirely in C/C++, your total execution time might be 1% faster as a guess, because whether you wrote that code that sets up the problem in C/C++ or Python, the C-written library will account for the majority of the execution time as it performs the requested calculations. For this reason, people enjoy the niceties of Python while benefiting from the speed of C where it matters.

As an example, let's say you setup a calculation where you want to multiply two square matrices with 1 million entries each. The Python code will execute slowly but in a tiny amount of total time. Then, the C libraries will perform that tremendous matrix multiplication, accounting for the majority of the execution time.

If you are comparing pure Python to tuned C or C++, expect Python to be anywhere between 10 to 1000 times slower than C or C++, depending on what task your program is doing. Pure Python will also tend to use more memory as well. So say you wrote that huge matrix multiplication in only Python rather than using NumPy. The execution time would be tremendous compared to using C/C++ or using NumPy.

look at the Benchmark Game where people submit solutions to various algorithms in various languages, and the website keeps track of the fastest submissions for each (algorithm, language) pair. You can even view the source code for each submission. For most test cases, Python is 2-15 times slower than C++. And if I'm not mistaken, the top Python solutions end up calling compiled C or C++ libraries to do the majority of the heavy lifting, so the test isn't really comparing pure Python to C++. This same thing happens in other languages e.g. the top Java solutions tend to call code written in C/C++ in some of the problems.

There are many reasons why C/C++ are faster than pure Python.

Python is an interpreted language whereas C/C++ compiles into machine code. This extra step of interpretation adds to the execution time. Additionally, the code executes on a virtual machine, which must take some CPU and RAM resources as well. C/C++ code executes directly on the hardware.
Python has a garbage collector that must run an algorithm to analyze the state of the program at intervals to prove a chunk of allocated memory no longer needs to exist, and only then will the program release that memory. C/C++ is different. All the allocations and deallocations are done at certain points in the machine code, avoiding running a garbage collector algorithm altogether. This will impact the performance of programs that allocate and deallocate a lot of memory more than programs that don't. To gain an idea of why a garbage collector is slower, every object has to have a reference count. The algorithm does bookkeeping to understand where an object could still be used to keep it around in case it is used. Additionally, the algorithm has to handle cycles like if I had objects A, B, and C with B holding a reference to C, C holding a reference to B, and A holding a reference to B. If everything in the program holds no reference to A, the garbage collector has to figure out that it can deallocate the memory for A, B, and C.
There are no primitives in Python. Everything, including the lowly integer, is a full object with metadata and methods. In C/C++, an int is a tiny chunk of 1s and 0s directly representing the number. It is much faster to pass around and use this barebones representation of an integer rather than deal with objects that contain that representation. Let's say you want to check if two integers are equal. In C/C++, machine code does a quick, single assembly instruction to compare two integers located here and there. In Python, a method from the Integer class is called. This entails Python needs to look up the type of the object you're actually referencing, look up the name of the property on that object, check that property actually is a method, compile the method if it hasn't been jit'd, then finally it can jump to the code in the comparison method. if jit isn't enabled then the program starts executing that method's code line by line. This is also related to Python being a dynamic while C/C++ is static. Dynamic roughly means things are looked up at runtime whereas static means things are figured out at compile time and the answer is written directly into the machine code.
Everything isn't just an object but also a pointer to an object. In C/C++, you can have a chunk of memory directly housing primitives or objects, so you could have the bits of 10 integers side by side in a solid chunk of memory. In Python, you have 10 pointers side by side in memory, and you have one layer of indirection to access each integer, jumping to a random part of the heap to figure out the value of each integer. This matters for two reasons: First, indirection is slower than having none. Secondly, most computer architectures use spatial locality, meaning when you access a part of memory that is RAM, chunks of nearby memory are loaded up into the incredibly fast CPU cache alongside it. We are talking about a read to RAM being 100 times slower than a read from level 1 cache. The designers of the hardware do that, because often, when you access location 0 in memory, you access location 1 and then 2 and then 3 in memory next. So something like iterating over n integers in C/C++ is about as fast as you can iterate over n things as possible whereas you will only iterate quickly over the n pointers quickly in Python. You will then access random chunks of memory in the heap that hold the integer in question, and the n integers may not even be stored next to each other in the heap.
C/C++ has many features that map more closely to the typical operations found in an assembly language. This proximity to the actual hardware enables a programmer to write tuned code to an absurd degree. Python, on the other hand, is pretty much completely divorced from the hardware, meaning a programmer can do no such manual optimizations even if they want speed. This is a good point to say that you don't use Python for speed. You use it for quick development and hopefully fewer bugs.
In the case where C++ has higher-level abstractions, you get a boatload of options in the exact way the compiler should implement those abstractions. This control allows for writing faster code if desired. Python instead tries to make the use of its abstractions as convenient and simple as possible.
With C/C++, the compiler can perform advanced analysis and optimization when producing the machine code, because waiting a while for your code to compile is an accepted cost to the programmer's time when dealing with these languages. Since Python interprets the code line by line, it simply cannot perform similar optimizations, because "compilation" time adds to execution time! If you were to bring in a ton of heavy analysis and optimizations into the Python interpreter, your program would freeze for minutes or more before it began executing the tuned Python bytecode.
C/C++ has a ton of inconvenient undefined behavior in the name of speed. If you mess up, that is on you. The program will blissfully continue executing its machine code, and its output will become undefined. It will likely soon crash, but it could just as easily continue running for days, writing incorrect results to your database. Python is different. It is much more convenient, doing all sorts of checks to report an error as soon as one comes up. Checking things is slower than not checking things.

Upvotes: 2

Marine Galantin

Reputation: 2289

I am actually surprised that no one mentioned Linear Algebra libraries like BLAS LAPACK MKL and all...

Numpy is using complex Linear Algebra libraries ! Essentially, Numpy is most of the time not built on pure c/cpp/fortran code... it is actually built on complex libraries that take advantage of the most performant algorithms and ideas to optimise the code. These complex libraries are hardly matched by naive implementation of classic linear algebra computations. The simplest first example of improvement is the blocking trick.

I took the following image from the CSE lab of ETH, where they compare matrix vector multiplication for different implementation. The y-axis represents the intensity of computations (in GFLOPs); long story short, it is how fast the computations are done. The x-axis is the dimension of the matrix.

C and C++ are fast languages, but actually if you want to mimic the speed of these libraries, you might have to go one step deeper and use either Fortran or intrinsics instructions (that are perhaps the closest to assembly code you can do in C++).

Consider the question Benchmarking (python vs. c++ using BLAS) and (numpy), where the very good answer from @Jfs, and we observe: "There is no difference between C++ and numpy on my machine."

Some more reference:

Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

Upvotes: 13

Gwidryj

Reputation: 1035

I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fast", but this doesn't explain why C should be slower than C with additional layer (even a thin one).

The answer is: your C++ code is not slower than your Python code when properly compiled.

I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.

I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main, to compile my C code I used gcc -O3 main.c -lm -o main. In all examples I filled data variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64), because there are doubles in C++ example. My pure C version of your code (it's similar):

#include <math.h>
#include <stdio.h>
#include <time.h>

const int k_max = 100000;
const int N = 10000;

int main(void)
{
    clock_t t_start, t_end;
    double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
    int z;
    for( z = 0; z < N; z++ )
    {
        data1[z] = 0.1;
        data2[z] = 0.4;
    }

    int i, j;
    t_start = clock();
    for( i = 0; i < k_max; i++ )
    {
        for( j = 0; j < N-1; j++ )
        {
            coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
            coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
        }
    }
    t_end = clock();

    seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("Time: %f s\n", seconds);
    return coefs1[0];
}

For k_max = 100000, N = 10000 results where following:

Python 70.284362 s
C++ 69.133199 s
C 61.638186 s

Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.

For k_max = 1000000, N = 1000 we have:

Python 115.42766 s
C++ 70.781380 s

For k_max = 1000000, N = 100:

Python 52.86826 s
C++ 7.050597 s

So the difference increases with fraction k_max/N, but python is not faster even for N much bigger than k_max, e. g. k_max = 100, N = 100000:

Python 0.651587 s
C++ 0.568518 s

Obviously, the main speed difference between C/C++ and Python is in the for loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.

Python script:

import numpy as np
from time import time

N = 10000
x_len = 100000

def main():
    x = np.ones(x_len, dtype=np.float64) * 1.2345

    start = time()
    for i in xrange(N):
        y1 = np.cos(x, dtype=np.float64)
    end = time()
    print('cos: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        y2 = x * 7.9463
    end = time()
    print('multi: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        res = np.sum(x, dtype=np.float64)
    end = time()
    print('sum: {} s'.format(end-start))

    return y1, y2, res

if __name__ == '__main__':
    main()

# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s

C script:

#include <math.h>
#include <stdio.h>
#include <time.h>

const int N = 10000;
const int x_len = 100000;

int main()
{
    clock_t t_start, t_end;
    double x[x_len], y1[x_len], y2[x_len], res, time;
    int i, j;
    for( i = 0; i < x_len; i++ )
    {
        x[i] = 1.2345;
    }

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y1[i] = cos(x[i]);
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("cos: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y2[i] = x[i] * 7.9463;
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("multi: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        res = 0.0;
        for( i = 0; i < x_len; i++ )
        {
            res += x[i];
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("sum: %f s\n", time);

    return y1[0], y2[0], res;
}

// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s

Python results:

cos: 22.7199969292 s
multi: 0.841291189194 s
sum: 1.15971088409 s

C results:

cos: 20.910590 s
multi: 0.633281 s
sum: 1.153001 s

As you can see NumPy is incredibly fast, but always a bit slower than pure C.

Upvotes: 64

Jerry Coffin

Reputation: 490808

On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).

I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:

#include <cstdio>
#include <iostream>
#include <cmath>
#include <chrono>
#include <vector>
#include <assert.h>

const unsigned int k_max = 40000;
const unsigned int N = 10000;

template <class T>
class matrix2 {
    std::vector<T> data;
    size_t cols;
    size_t rows;
public:
    matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
    T &operator()(size_t y, size_t x) {
        assert(x <= cols);
        assert(y <= rows);
        return data[y*cols + x];
    }

    T operator()(size_t y, size_t x) const {
        assert(x <= cols);
        assert(y <= rows);
        return data[y*cols + x];
    }
};

int main() {
    matrix2<double> data(N, 2);
    matrix2<double> coeffs(k_max, 2);

    using namespace std::chrono;

    auto start = high_resolution_clock::now();

    for (int k = 0; k < k_max; k++) {
        for (int j = 0; j < N - 1; j++) {
            coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
            coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
        }
    }

    auto end = high_resolution_clock::now();
    std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
}

This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.

The next obvious step would be to use multiple cores. To do that in C++, we can add this line:

#pragma omp parallel for

...before the outer for loop:

#pragma omp parallel for
for (int k = 0; k < k_max; k++) {
    for (int j = 0; j < N - 1; j++) {
        coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
        coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
    }
}

With -openmp added to the compiler's command line (though the exact flag depends on the compiler you're using, of course), this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).

Upvotes: 7

quark

Reputation: 39

I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs, hence should switch your for-loops. If this is the case, you should have the following:

#include <iostream>
#include <cmath>
#include <time.h>

const int k_max = 40000;
const int N = 10000;

double cos_k, sin_k;

int main(int argc, char const *argv[])
{
    time_t start, stop;
    double data[2][N];
    double coefs[k_max][2];

    time(&start);

    for(int i=0; i<k_max; ++i)
    {
        for(int j=0; j<N; ++j)
        {
            coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
            coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
        }
    }
    // End of main loop

    time(&stop);

    // Speed result
    double diff = difftime(stop, start);
    std::cout << "Time: " << diff << " seconds" << std::endl;

    return 0;
}

Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3, while Python code runs at 7.816 seconds.

Upvotes: 3

Why is my Python NumPy code faster than C++?

Version 2 of the code

Version 3 of the code

Answers (5)

Related Questions