samis13
samis13

Reputation: 31

Running C extension in Python faster than plain C

I have implemented a Python extension in C and found that executing a C function inside of Python to be 2x faster than just executing the C code from a C main.

But why is this faster? I would expect the plain C to be exactly the same performance when called from Python as it is when called from C.

Here is my experiment:

Here are my results:

Pure C - 85us

Python Extension - 36us


Heres my code:


--mmult.cpp----------

#include "mmult.h"

void mmult(int32_t a[1024],int32_t b[1024],int32_t c[1024]) {

  struct timeval t1, t2;
  gettimeofday(&t1, NULL);

  for(int i=0; i<32; i=i+1) {
    for(int j=0; j<32; j=j+1) {
        int32_t result=0;
         for(int k=0; k<32; k=k+1) {
           result+=a[i*32+k]*b[k*32+j];
         }
         c[i*32+j] = result;
      }
  }

  gettimeofday(&t2, NULL);

  double elapsedTime = (t2.tv_usec - t1.tv_usec) + (t2.tv_sec - t1.tv_sec)*1000000;
  printf("elapsed time: %fus\n",elapsedTime);

}

--mmult.h-------

#include <stdint.h>

void mmult(int32_t a[1024],int32_t b[1024],int32_t c[1024]);

--main.cpp------

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include "mmult.h"

int main() {
  int* a = (int*)malloc(sizeof(int)*1024);
  int* b = (int*)malloc(sizeof(int)*1024);
  int* c = (int*)malloc(sizeof(int)*1024);

  for(int i=0; i<1024; i++) {
    a[i]=i+1;
    b[i]=i+1;
    c[i]=0;
  }

  struct timeval t1, t2;
  gettimeofday(&t1, NULL);
  mmult(a,b,c);
  gettimeofday(&t2, NULL);

  double elapsedTime = (t2.tv_usec - t1.tv_usec) + (t2.tv_sec - t1.tv_sec)*1000000;
  printf("elapsed time: %fus\n",elapsedTime);
  free(a);
  free(b);
  free(c);

  return 0;
}

Heres how I compile main:

gcc -o main main.cpp mmult.cpp -O3

--wrapper.cpp-----

#include <Python.h>
#include <numpy/arrayobject.h>
#include "mmult.h"

static PyObject* mmult_wrapper(PyObject* self, PyObject* args) {
   int32_t* a;
   PyArrayObject* a_obj = NULL;
   int32_t* b;
   PyArrayObject* b_obj = NULL;
   int32_t* c;
   PyArrayObject* c_obj = NULL;

   int res = PyArg_ParseTuple(args, "OOO", &a_obj, &b_obj, &c_obj);

   if (!res)
      return NULL;

   a = (int32_t*) PyArray_DATA(a_obj);
   b = (int32_t*) PyArray_DATA(b_obj);
   c = (int32_t*) PyArray_DATA(c_obj);

   /* call function */
   mmult(a,b,c);

   Py_RETURN_NONE;
}

/*  define functions in module */
static PyMethodDef TheMethods[] = {
   {"mmult_wrapper", mmult_wrapper, METH_VARARGS, "your c function"},
   {NULL, NULL, 0, NULL}
};

static struct PyModuleDef cModPyDem = {
   PyModuleDef_HEAD_INIT,
   "mmult", "Some documentation",
   -1,
   TheMethods
};

PyMODINIT_FUNC
PyInit_c_module(void) {
   PyObject* retval = PyModule_Create(&cModPyDem);
   import_array();
   return retval;
}

--setup.py-----

import os
import numpy
from distutils.core import setup, Extension
cur = os.path.dirname(os.path.realpath(__file__))
c_module = Extension("c_module", sources=["wrapper.cpp","mmult.cpp"],include_dirs=[cur,numpy.get_include()])
setup(ext_modules=[c_module])

--code.py-----

import c_module
import time
import numpy as np
if __name__ == "__main__":
    a = np.ndarray((32,32),dtype='int32',buffer=np.linspace(1,1024,1024,dtype='int32').reshape(32,32))
    b = np.ndarray((32,32),dtype='int32',buffer=np.linspace(1,1024,1024,dtype='int32').reshape(32,32))
    c = np.ndarray((32,32),dtype='int32',buffer=np.zeros((32,32),dtype='int32'))

    c_module.mmult_wrapper(a,b,c)

Heres how I compile the Python extension:

python3.6 setup_sw.py build_ext --inplace

UPDATE

Ive updated the mmult.cpp code to run the 3for for 1,000,000 iterations internally. This resulted in very similar times:

Pure C - 27us

Python Extension - 27us

Upvotes: 3

Views: 184

Answers (1)

85 microseconds is too small a delay to be measured reliably and repeatedly. For example, CPU cache effects (or context switches, or paging) may dominate the computation time (and alter it to make that timing meaningless).

(I guess you are on Linux/x86-64)

As a rule of thumb, try to have a run lasting about half a second at least, and repeat the benchmarking a few times. You could also use time(1) for measurements.

See also time(7). There are several notions of time (elapsed "real" time, monotonic time, process cpu time, thread cpu time, etc...). You could consider using clock(3) or clock_gettime(2) to measure time.

BTW, you might compile with a more recent version of GCC (in November 2017, GCC7 and in a few weeks GCC8) and you want to compile with gcc -march=native -O3 for benchmarking purposes. Try also other optimization options and tuning. You could also try another compiler, e.g. Clang/LLVM.

Look also at this answer (regarding parallelization) to a relevant question. Probably the numpy package is using (internally) similar techniques (outside of the Python GIL), so could be faster than your naive sequential matrix multiplication code in C.

Upvotes: 7

Related Questions