coredump
coredump

Reputation: 38789

How to reduce execution time differences between C API and Python executable?

Running the same python script using python3 or through an embedded interpreter using libpython3 gives different execution times.

$ time PYTHONPATH=. ./simple
real    0m6,201s
user    1m3,680s
sys     0m0,212s

$ time PYTHONPATH=. python3 -c 'import test; test.run()'
real    0m5,193s
user    0m53,349s
sys     0m0,164s

(removing the content of __pycache__ between runs does not seem to have an impact)

Currently, calling python3 with the script is faster; on my actual use case the factor is 1.5 faster, compared to the same script ran from within an embedded interpreter.

I would like to (1) understand where does the difference come from and (2) if it is possible to have the same performance using an embedded interpreter? (using e.g. cython is currently not an option).

Code

simple.cpp
#include <Python.h>

int main()
{
        Py_Initialize();
        const char* pythonScript = "import test; test.run()";
        int result = PyRun_SimpleString(pythonScript);
        Py_Finalize();
        return result;
}

Compilation:

 g++ -std=c++11 -fPIC $(python3-config --cflags) simple.cpp \
 $(python3-config --ldflags) -o simple
test.py
import sys
sys.stdout = open('output.bin', 'bw')
import mandel
def run():
    mandel.mandelbrot(4096)
mandel.py

Tweaked version from benchmarks-game's Mandlebrot (see License)

from contextlib import closing
from itertools import islice
from os import cpu_count
from sys import stdout

def pixels(y, n, abs):
    range7 = bytearray(range(7))
    pixel_bits = bytearray(128 >> pos for pos in range(8))
    c1 = 2. / float(n)
    c0 = -1.5 + 1j * y * c1 - 1j
    x = 0
    while True:
        pixel = 0
        c = x * c1 + c0
        for pixel_bit in pixel_bits:
            z = c
            for _ in range7:
                for _ in range7:
                    z = z * z + c
                if abs(z) >= 2.: break
            else:
                pixel += pixel_bit
            c += c1
        yield pixel
        x += 8

def compute_row(p):
    y, n = p

    result = bytearray(islice(pixels(y, n, abs), (n + 7) // 8))
    result[-1] &= 0xff << (8 - n % 8)
    return y, result

def ordered_rows(rows, n):
    order = [None] * n
    i = 0
    j = n
    while i < len(order):
        if j > 0:
            row = next(rows)
            order[row[0]] = row
            j -= 1

        if order[i]:
            yield order[i]
            order[i] = None
            i += 1

def compute_rows(n, f):
    row_jobs = ((y, n) for y in range(n))

    if cpu_count() < 2:
        yield from map(f, row_jobs)
    else:
        from multiprocessing import Pool
        with Pool() as pool:
            unordered_rows = pool.imap_unordered(f, row_jobs)
            yield from ordered_rows(unordered_rows, n)

def mandelbrot(n):
    write = stdout.write

    with closing(compute_rows(n, compute_row)) as rows:
        write("P4\n{0} {0}\n".format(n).encode())
        for row in rows:
            write(row[1])

Upvotes: 1

Views: 155

Answers (1)

coredump
coredump

Reputation: 38789

So apparently the time difference comes from either linking with libpython statically vs. dynamically. In a Makefile sitting next to python.c (from the reference implementation), the following builds a statically linked version of the interpreter:

snake: python.c
    g++ \
    -I/usr/include/python3.6m \
    -pthread \
    -specs=/usr/share/dpkg/no-pie-link.specs \
    -specs=/usr/share/dpkg/no-pie-compile.specs \
    \
    -Wall \
    -Wformat \
    -Werror=format-security \
    -Wno-unused-result \
    -Wsign-compare \
    -DNDEBUG \
    -g \
    -fwrapv \
    -fstack-protector \
    -O3 \
    \
    -Xlinker -export-dynamic \
    -Wl,-Bsymbolic-functions \
    -Wl,-z,relro \
    -Wl,-O1 \
    python.c \
    /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.a \
    -lexpat \
    -lpthread \
    -ldl \
    -lutil \
    -lexpat \
    -L/usr/lib \
    -lz \
    -lm \
    -o $@

Changing the line /usr/lib/.../libpython3.6m.a with -llibpython3.6m builds the version that ends up being slower (also need -L/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu)


Epilog

The difference in speed exists but is not the full answer to my original problem; in practice the "slower" interpreter was executed under a specific LD_PRELOAD environment which changed how system time functions behaves in a way that messed up with cProfile.

Upvotes: 1

Related Questions