Reputation: 38789
Running the same python script using python3
or through an embedded interpreter using libpython3
gives different execution times.
$ time PYTHONPATH=. ./simple
real 0m6,201s
user 1m3,680s
sys 0m0,212s
$ time PYTHONPATH=. python3 -c 'import test; test.run()'
real 0m5,193s
user 0m53,349s
sys 0m0,164s
(removing the content of __pycache__
between runs does not seem to have an impact)
Currently, calling python3
with the script is faster; on my actual use case the factor is 1.5 faster, compared to the same script ran from within an embedded interpreter.
I would like to (1) understand where does the difference come from and (2) if it is possible to have the same performance using an embedded interpreter? (using e.g. cython is currently not an option).
#include <Python.h>
int main()
{
Py_Initialize();
const char* pythonScript = "import test; test.run()";
int result = PyRun_SimpleString(pythonScript);
Py_Finalize();
return result;
}
Compilation:
g++ -std=c++11 -fPIC $(python3-config --cflags) simple.cpp \
$(python3-config --ldflags) -o simple
test.py
import sys
sys.stdout = open('output.bin', 'bw')
import mandel
def run():
mandel.mandelbrot(4096)
mandel.py
Tweaked version from benchmarks-game's Mandlebrot (see License)
from contextlib import closing
from itertools import islice
from os import cpu_count
from sys import stdout
def pixels(y, n, abs):
range7 = bytearray(range(7))
pixel_bits = bytearray(128 >> pos for pos in range(8))
c1 = 2. / float(n)
c0 = -1.5 + 1j * y * c1 - 1j
x = 0
while True:
pixel = 0
c = x * c1 + c0
for pixel_bit in pixel_bits:
z = c
for _ in range7:
for _ in range7:
z = z * z + c
if abs(z) >= 2.: break
else:
pixel += pixel_bit
c += c1
yield pixel
x += 8
def compute_row(p):
y, n = p
result = bytearray(islice(pixels(y, n, abs), (n + 7) // 8))
result[-1] &= 0xff << (8 - n % 8)
return y, result
def ordered_rows(rows, n):
order = [None] * n
i = 0
j = n
while i < len(order):
if j > 0:
row = next(rows)
order[row[0]] = row
j -= 1
if order[i]:
yield order[i]
order[i] = None
i += 1
def compute_rows(n, f):
row_jobs = ((y, n) for y in range(n))
if cpu_count() < 2:
yield from map(f, row_jobs)
else:
from multiprocessing import Pool
with Pool() as pool:
unordered_rows = pool.imap_unordered(f, row_jobs)
yield from ordered_rows(unordered_rows, n)
def mandelbrot(n):
write = stdout.write
with closing(compute_rows(n, compute_row)) as rows:
write("P4\n{0} {0}\n".format(n).encode())
for row in rows:
write(row[1])
Upvotes: 1
Views: 155
Reputation: 38789
So apparently the time difference comes from either linking with libpython
statically vs. dynamically. In a Makefile sitting next to python.c
(from the reference implementation), the following builds a statically linked version of the interpreter:
snake: python.c
g++ \
-I/usr/include/python3.6m \
-pthread \
-specs=/usr/share/dpkg/no-pie-link.specs \
-specs=/usr/share/dpkg/no-pie-compile.specs \
\
-Wall \
-Wformat \
-Werror=format-security \
-Wno-unused-result \
-Wsign-compare \
-DNDEBUG \
-g \
-fwrapv \
-fstack-protector \
-O3 \
\
-Xlinker -export-dynamic \
-Wl,-Bsymbolic-functions \
-Wl,-z,relro \
-Wl,-O1 \
python.c \
/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.a \
-lexpat \
-lpthread \
-ldl \
-lutil \
-lexpat \
-L/usr/lib \
-lz \
-lm \
-o $@
Changing the line /usr/lib/.../libpython3.6m.a
with -llibpython3.6m
builds the version that ends up being slower (also need -L/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu
)
Epilog
The difference in speed exists but is not the full answer to my original problem; in practice the "slower" interpreter was executed under a specific LD_PRELOAD environment which changed how system time functions behaves in a way that messed up with cProfile.
Upvotes: 1