Reputation: 765
I was curious about the performance of insert-sort using C and python but the results I've got just make me think if I've done something wrong. I suspected that C would be faster, but not that much.
I've profiled both codes and the insert-sort function is the place where the time is most spent.
Here is the C function:
void
insert_sort (vec_t * vec)
{
int j;
for (j = 1 ; j < vec->n ; j++){
int key = vec->v[j];
int i = j - 1;
while (i >= 0 && vec->v[i] > key){
vec->v[i+1] = vec->v[i];
i--;
}
vec->v[i+1] = key;
}
}
Here is the python function:
def insert_sort (ln):
for j in range(1, len(ln)):
key = ln[j]
i = j-1
while i >= 0 and ln[i] > key:
ln[i+1] = ln[i]
i-=1
ln[i+1] = key
The test was made with 10000 integers, each one randomly generated between 0 and 10000.
The results for the time spent in each function was:
Am I doing something wrong here? Like I said, I expected to see the C code being faster, but not that faster.
I don't want to use built-in functions or whatsoever. I would like to implement the algorithm. Is there a pythonic way to doing things that I could use in the insert-sort?
Upvotes: 0
Views: 1834
Reputation: 304147
What method did you use to measure the time?
Doing this sort of thing, I find python is at least 30 times slower than C
The C compiler may be able to use some optimisations that Python doesn't even attempt
If might be interesting to try psyco, that type of code is well suited to it.
building on Alex's answer, I tried cython. In his case cython turns the for loop and everything into pure C, so now I can compare C, python and psyco
now i have this insort.py
import psyco
import random
li = [random.randrange(10000) for _ in xrange(10000)]
def insort (ln):
for j in range(1, len(ln)):
key = ln[j]
i = j-1
while i >= 0 and ln[i] > key:
ln[i+1] = ln[i]
i-=1
ln[i+1] = key
#psyco.bind(insort)
import pyximport; pyximport.install()
import pyxinsort
def pyx_setup():
pyxinsort.setup(li)
def pyx_insort():
pyxinsort.insort(li)
and this pyxinsort.pyx
cdef int ln[10000]
def insort(li):
cdef int i,j,key
for j in range(1, len(li)):
key = ln[j]
i = j-1
while i >= 0 and ln[i] > key:
ln[i+1] = ln[i]
i-=1
ln[i+1] = key
def setup(li):
cdef int i
for i in range(1, len(li)):
ln[i]=li[i]
The code for insort is virtually identical. li
is passed in for its length. ln
is the array that is sorted and is prepopulated by setup, so I can isolate building the list from the sort
$ python2.5 -mtimeit -s'import inso' 'list(inso.li)' 10000 loops, best of 3: 84.5 usec per loop $ python2.5 -mtimeit -s'import inso' 'inso.insort(list(inso.li))' 10 loops, best of 3: 21.9 sec per loop
$ python2.5 -mtimeit -s'import inso' 'list(inso.li)' 10000 loops, best of 3: 85.6 usec per loop $ python2.5 -mtimeit -s'import inso' 'inso.insort(list(inso.li))' 10 loops, best of 3: 578 msec per loop
$ python2.5 -mtimeit -s'import inso' 'inso.pyx_setup()' 10000 loops, best of 3: 141 usec per loop $ python2.5 -mtimeit -s'import inso' 'inso.pyx_setup();inso.pyx_insort()' 10 loops, best of 3: 46.6 msec per loop
cython beats psyco by a factor of 16 and Python by a factor of 470!
For completeness, i've included the corresponding piece of C code generated by cython
for (__pyx_v_j = 1; __pyx_v_j < __pyx_1; __pyx_v_j+=1) {
__pyx_v_key = (__pyx_v_9pyxinsort_ln[__pyx_v_j]);
__pyx_v_i = (__pyx_v_j - 1);
while (1) {
__pyx_2 = (__pyx_v_i >= 0);
if (__pyx_2) {
__pyx_2 = ((__pyx_v_9pyxinsort_ln[__pyx_v_i]) > __pyx_v_key);
}
if (!__pyx_2) break;
(__pyx_v_9pyxinsort_ln[(__pyx_v_i + 1)]) = (__pyx_v_9pyxinsort_ln[__pyx_v_i]);
__pyx_v_i -= 1;
}
(__pyx_v_9pyxinsort_ln[(__pyx_v_i + 1)]) = __pyx_v_key;
}
Upvotes: 2
Reputation: 881595
My first thought was that the laptop I have at hand right now, a Macbook Pro, must be comparable to but slightly better than your machine -- I don't have enough of your surrounding code to try your C example (what's a vec_t, etc, etc), but running the Python you coded gives me:
$ python -mtimeit -s'import inso' 'inso.insort(inso.li)'
10 loops, best of 3: 7.21 msec per loop
vs your 8.1 seconds. That's with you code put in insort.py
, preceded by:
import random
li = [random.randrange(10000) for _ in xrange(10000)]
array
doesn't help -- actually slows things down a bit. Then I installed psyco, the Python JIT helper (x86-only, 32-bit only), further added:
import psyco
psyco.full()
and got:
$ python -mtimeit -s'import inso' 'inso.insort(inso.li)'
10 loops, best of 3: 207 usec per loop
so a speedup of about 7.21 / 0.000207 = 34830 times -- vs the 8.04 / 0.13 = 62 times that surprised you so much;-).
Of course, the problem is that after the first time, the list is already sorted, so insort becomes must faster. You didn't give us enough of the surrounding test harness to know exactly what you measured. A more realisting example (where the actual list isn't touched so it stays disordered, only a copy is sorted...), without psyco:
$ python -mtimeit -s'import inso' 'inso.insort(list(inso.li))'
10 loops, best of 3: 13.8 sec per loop
Oops -- so your machine's WAY faster than a Macbook Pro (remembers, core don't count: we're using only one here;-) -- wow... or else, you're mismeasuring. Anyway, WITH psyco:
$ python -mtimeit -s'import inso' 'inso.insort(list(inso.li))'
10 loops, best of 3: 456 msec per loop
So psyco's speedup is only 13.8 / 0.456, 30 times -- about half as much as the 60+ times you get with pure-C coding. IOW, you'd expect python + psyco to be twice as slow as pure C. That's a more realistic and typical assessment.
If you we writing reasonably high-level code, psyco's speedup of it would degrade from (say) 30 times down to much less -- but so would C's advantage over Python. For example,
$ python -mtimeit -s'import inso' 'sorted(inso.li)'
100 loops, best of 3: 8.72 msec per loop
without psyco (in this case, psyco actually -- marginally -- slows down the execution;-), so that's another factor of 52 over psyco, 1582 overall over non-psyco insort.
But, when for some reason or other you have to write extremely low-level algorithms in python, rather than using the wealth of support from the builtins and stdlib, psyco can help reduce the pain.
Another point is, when you benchmark, please post ALL code so others can see exactly what you're doing (and possibly spot gotchas) -- your "scaffolding" is as tricky and likely to hide traps, as the code you think you're measuring!-)
Upvotes: 5
Reputation: 76715
So, here are some lessons you should take away from this:
Interpreted Python is on the slow side. Don't try to write your own FFT, MPEG encoder, etc. in Python.
Even slow interpreted Python is probably fast enough for small problems. An 8 second run time is not horrible, and it would take you much longer to write and debug the C than the Python, so if you are writing something to run once, Python wins.
For speed in Python, try to rely on built-in features and C modules. Let someone else's C code do the heavy lifting. I worked on an embedded device where we did our work in Python; despite the slow embedded processor, the performance was decent, because C library modules were doing most of the work.
For fun and education, please repeat your Python test, this time using the built-in .sort()
method on a list; it probably won't be quite as fast as the C, but it will be close. (Although for really large data sets, it will beat the C, because insertion sort sucks. If you rewrote the C to use the C library qsort()
function, that would be the speed champ.)
A common Python design "pattern" is: first, write your app in Python. If it is fast enough, stop; you are done. Second, try to rewrite to improve speed; see if there is a C module you can use, for example. If it is still not fast enough, consider writing your own C module; or, write a C program, using the Python prototype code as the basis for your design.
Upvotes: 4
Reputation: 54882
Python is a dynamic language and the standard implementation uses an interpreter to evaluate code. This means that where the compiled C code can escape with a single machine instruction, for instance assigning to vec->v[i+1], Python's interpreter has to look up the sequence variable from the local scope, look up its class, find the item setting method on the class, call that method. Similarly for the compare, addition. Not to mention that executing almost every bytecode results in an indirect branch mispredict in the CPU that causes a pipeline bubble.
This is the sort of code that would benefit a lot from JIT compilation to native code and runtime type specialization, like unladen-swallow and PyPy are starting to do.
Otherwise the code is pretty much pythonic in the sense that if one needs to implement insertion sort, this is how one would do it in Python. It's also very unpythonic because you should use the very efficient built-in sort.
Upvotes: 13