Python+C is (slightly) faster than pure C

Question

I've been implementing the same code (the number of ways to deal a hand in Blackjack without busting) in a variety of languages and implementations. One oddity I've noticed is that the implementation of Python calling the partitions function in C is actually slightly faster than the entire program written in C. The same appears to be true for other languages (Ada vs Python calling Ada, Nim vs Python calling Nim). This seems counterintuitive to me - any idea how this is possible?

The code is all in my GitHub repo here:

https://github.com/octonion/puzzles/tree/master/blackjack

Here's the C code, compiled using 'gcc -O3 outcomes.c'.

#include 

int partitions(int cards[10], int subtotal)
{
    //writeln(cards,subtotal);
    int m = 0;
    int total;
    // Hit
    for (int i = 0; i < 10; i++)
    {
        if (cards[i] > 0)
        {
            total = subtotal + i + 1;
            if (total < 21)
            {
                // Stand
                m += 1;
                // Hit again
                cards[i] -= 1;
                m += partitions(cards, total);
                cards[i] += 1;
            }
            else if (total == 21)
            {
                // Stand; hit again is an automatic bust
                m += 1;
            }
        }
    }
    return m;
}

int main(void)
{
    int deck[] =
    { 4, 4, 4, 4, 4, 4, 4, 4, 4, 16 };
    int d = 0;

    for (int i = 0; i < 10; i++)
    {
        // Dealer showing
        deck[i] -= 1;
        int p = 0;
        for (int j = 0; j < 10; j++)
        {
            deck[j] -= 1;
            int n = partitions(deck, j + 1);
            deck[j] += 1;
            p += n;
        }

        printf("Dealer showing %i partitions = %i
", i, p);
        d += p;
        deck[i] += 1;
    }
    printf("Total partitions = %i
", d);
    return 0;
}

Here's the C function, compiled using 'gcc -O3 -fPIC -shared -o libpartitions.so partitions.c'.

int partitions(int cards[10], int subtotal)
{
    int m = 0;
    int total;
    // Hit
    for (int i = 0; i < 10; i++)
    {
        if (cards[i] > 0)
        {
            total = subtotal + i + 1;
            if (total < 21)
            {
                cards[i] -= 1;
                // Stand
                m += 1;
                // Hit again
                m += partitions(cards, total);
                cards[i] += 1;
            }
            else if (total == 21)
            {
                // Stand; hit again is an automatic bust
                m += 1;
            }
        }
    }
    return m;
}

And here's the Python wrapper for the C function:

#!/usr/bin/env python

from ctypes import *
import os

test_lib = cdll.LoadLibrary(os.path.abspath("libpartitions.so"))
test_lib.partitions.argtypes = [POINTER(c_int), c_int]
test_lib.partitions.restype = c_int

deck = ([4]*9)
deck.append(16)

d = 0

for i in xrange(10):
    # Dealer showing
    deck[i] -= 1
    p = 0
    for j in xrange(10):
        deck[j] -= 1
        nums_arr = (c_int*len(deck))(*deck)
        n = test_lib.partitions(nums_arr, c_int(j+1))
        deck[j] += 1
        p += n
    print('Dealer showing ', i,' partitions =',p)
    d += p
    deck[i] += 1

print('Total partitions =',d)

hgminh · Accepted Answer

I think the reason here is how GCC compiles function partitions in 2 cases. You can compare asm code in outcomes binary executable and libpartitions.so by using objdump to see the differences.

objdump -d -M intel

When building to shared library, GCC has no idea how partitions is called. While in C program, GCC know exactly when partitions is called (in this case, however, lead to worse performance). This difference in context makes GCC does optimization differently.

You can try different compilers to compare the result. I have checked with GCC 5.4 and Clang 6.0. With GCC 5.4, the Python script runs faster while with Clang, the C program runs faster.

Python+C is (slightly) faster than pure C

Answers (1)

Related Questions