Python 3 multiprocessing on 1 core gives overhead that grows with workload

Question

I am testing the parallel capabilities of Python3, which I intend to use in my code. I observe unexpectedly slow behaviour, and so I boil down my code to the following proof of principle. Let's calculate a simple logarithmic series. Let's do it serial, and in parallel using 1 core. One would imagine that the timing for these two examples would be the same, except for a small overhead associated with initializing and closing the multiprocessing.Pool class. However, what I observe is that the overhead grows linearly with problem size, and thus the parallel solution on 1 core is significantly worse relative to the serial solution even for large inputs. Please tell me if I am doing something wrong

import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt


def foo(x):
    return sum([np.log(1 + i*x) for i in range(10)])


def serial_series(rangeMax):
    return [foo(x) for x in range(rangeMax)]

def parallel_series_1core(rangeMax):
    pool = multiprocessing.Pool(processes=1)
    rez = pool.map(foo, tuple(range(rangeMax)))

    pool.terminate()
    pool.join()

    return rez


nTask = [1 + i ** 2 * 1000 for i in range(1, 2)]
nTimeSerial = []
nTimeParallel = []

for taskSize in nTask:
    print('TaskSize', taskSize)
    start = time.time()
    rez = serial_series(taskSize)
    end = time.time()
    nTimeSerial.append(end - start)

    start = time.time()
    rez = parallel_series_1core(taskSize)
    end = time.time()
    nTimeParallel.append(end - start)


plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)

plt.legend(['serial', 'parallel 1 core'])
plt.show()

Edit: It was commented that the overhead my be due to creating multiple jobs. Here is a modification of the parallel function that should explicitly only make 1 job. I still observe linear growth of the overhead

def parallel_series_1core(rangeMax):
    pool = multiprocessing.Pool(processes=1)
    rez = pool.map(serial_series, [rangeMax])

    pool.terminate()
    pool.join()

    return rez

Edit 2: Once more, the exact code that produces linear growth. It can be tested with a print statement inside the serial_series function that it is only called once for each call of parallel_series_1core.

import time
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt

def foo(x):
    return sum([np.log(1 + i*x) for i in range(10)])

def serial_series(rangeMax):
    return [foo(i) for i in range(rangeMax)]

def parallel_series_1core(rangeMax):
    pool = multiprocessing.Pool(processes=1)
    rez = pool.map(serial_series, [rangeMax])

    pool.terminate()
    pool.join()

    return rez


nTask = [1 + i ** 2 * 1000 for i in range(1, 20)]
nTimeSerial = []
nTimeParallel = []

for taskSize in nTask:
    print('TaskSize', taskSize)
    start = time.time()
    rez1 = serial_series(taskSize)
    end = time.time()
    nTimeSerial.append(end - start)

    start = time.time()
    rez2 = parallel_series_1core(taskSize)
    end = time.time()
    nTimeParallel.append(end - start)


plt.plot(nTask, nTimeSerial)
plt.plot(nTask, nTimeParallel)
plt.plot(nTask, [i / j for i,j in zip(nTimeParallel, nTimeSerial)])

plt.legend(['serial', 'parallel 1 core', 'ratio'])
plt.show()

zwer · Accepted Answer

When you use Pool.map() you're essentially telling it to split the passed iterable into jobs over all available sub-processes (which is one in your case) - the larger the iterable the more 'jobs' are created on the first call. That's what initially adds a huge (trumped only by the process creation itself), albeit linear overhead.

Since sub-processes do not share memory, for all changing data on POSIX systems (due to forking) and all data (even static) on Windows it needs to pickle it on one end and unpickle it on the other. Plus it needs time to clear out the process stack for the next job, plus there is an overhead in system thread switching (that's out of your control, you'd have to mess with the system's scheduler to reduce that one).

For simple/quick tasks a single process will always trump multiprocessing.

UPDATE - As I was saying above, the additional overhead comes from the fact that for any data exchange between processes Python transparently does pickling/unpickling routine. Since the list you return from the serial_series() function grows in size over time, so does the performance penalty for pickling/unpickling. Here's a simple demonstration of it based on your code:

import math
import pickle
import sys
import time

# multi-platform precision timer
get_timer = time.clock if sys.platform == "win32" else time.time

def foo(x):  # logic/computation function
    return sum([math.log(1 + i*x) for i in range(10)])

def serial_series(max_range):  # main sub-process function
    return [foo(i) for i in range(max_range)]

def serial_series_slave(max_range):  # subprocess interface
    return pickle.dumps(serial_series(pickle.loads(max_range)))

def serial_series_master(max_range):  # main process interface
    return pickle.loads(serial_series_slave(pickle.dumps(max_range)))

tasks = [1 + i ** 2 * 1000 for i in range(1, 20)]
simulated_times = []
for task in tasks:
    print("Simulated task size: {}".format(task))
    start = get_timer()
    res = serial_series_master(task)
    simulated_times.append((task, get_timer() - start))

At the end, simulated_times will contain something like:

[(1001, 0.010015994115533963), (4001, 0.03402641167313844), (9001, 0.06755546622419131),
 (16001, 0.1252664260421834), (25001, 0.18815836740279515), (36001, 0.28339434475444325),
 (49001, 0.3757235840503601), (64001, 0.4813749807557435), (81001, 0.6115452710446636),
 (100001, 0.7573718332506543), (121001, 0.9228750064147522), (144001, 1.0909038813527427),
 (169001, 1.3017281342479343), (196001, 1.4830192955746764), (225001, 1.7117389965616931),
 (256001, 1.9392146632682739), (289001, 2.19192682050668), (324001, 2.4497541011649187),
 (361001, 2.7481495578097466)]

showing clear greater-than-linear processing time increase as the list grows bigger. This is what essentially happens with multiprocessing - if your sub-process function didn't return anything it would end up considerably faster.

If you have a large amount of data you need to share among processes, I'd suggest you to use some in-memory database (like Redis) and have your sub-processes connect to it to store/retrieve data.

Python 3 multiprocessing on 1 core gives overhead that grows with workload

Answers (1)

Related Questions