EMChamp
EMChamp

Reputation: 449

Overhead in creating a thread vs process on Linux

I am trying to answer the question of how much overhead there is in creating a thread vs a process in python. I modified code from a similar question which basically runs a function with two threads and then runs the same function with two processes and reports the time.

import time, sys
NUM_RANGE = 100000000

from multiprocessing  import Process
import threading

def timefunc(f):
    t = time.time()
    f()
    return time.time() - t

def multiprocess():
    class MultiProcess(Process):
        def __init__(self):
            Process.__init__(self)

        def run(self):
            # Alter string + test processing speed
            for i in xrange(NUM_RANGE):
                a = 20 * 20


    for _ in xrange(300):
      MultiProcess().start()

def multithreading():
    class MultiThread(threading.Thread):
        def __init__(self):
            threading.Thread.__init__(self)

        def run(self):
            # Alter string + test processing speed
            for i in xrange(NUM_RANGE):
                a = 20 * 20

    for _ in xrange(300):
      MultiThread().start()

print "process run time" + str(timefunc(multiprocess))
print "thread run time" + str(timefunc(multithreading))

Then I got 7.9s for multiprocessing and 7.9s for multithreading

The main question I'm trying to answer is if it is appropriate to use multithreading or multiprocessing for thousands of network requests on Linux specifically. Seems like according to this code they are the same in terms of startup time but perhaps processes are much heavier in memory usage?

Upvotes: 5

Views: 4691

Answers (3)

Darkonaut
Darkonaut

Reputation: 21644

Your code is not suitable to benchmark start up times between processes and threads. Multithreading Python-code (in CPython) means single core. Any Python-code execution in one thread will exclude proceeding for all other threads in this process for the time this one thread is holding the global interpreter lock (GIL). This means you can only have concurrency with threads and not true parallelism as long it concerns Python bytecode.

Your example is mainly benchmarking a specific CPU-bound workload-performance (running a calculation within a tight loop), something you wouldn't use threads for anyway. If you want to measure creation overhead you must strip anything but creation itself from your benchmark (as far possible).


TL; DR

Starting up a thread (benchmarked on Ubuntu 18.04) is many times cheaper than starting up a process.

Compared to thread start-up, process start up with specified start_methods takes:

  • fork: ~33x longer
  • forkserver: ~6693x longer
  • spawn: ~7558x longer

Full results at the bottom.


Benchmark

I recently upgraded to Ubuntu 18.04 and tested start up with a script that hopefully is closer to the truth. Note this code is Python 3.

Some utilities for formatting and comparing the test results:

# thread_vs_proc_start_up.py
import sys
import time
import pandas as pd
from threading import Thread
import multiprocessing as mp
from multiprocessing import Process, Pipe


def format_secs(sec, decimals=2) -> str:
    """Format subseconds.

    Example:
    >>>format_secs(0.000_000_001)
    # Out: '1.0 ns'
    """
    if sec < 1e-6:
        return f"{sec * 1e9:.{decimals}f} ns"
    elif sec < 1e-3:
        return f"{sec * 1e6:.{decimals}f} µs"
    elif sec < 1:
        return f"{sec * 1e3:.{decimals}f} ms"
    elif sec >= 1:
        return f"{sec:.{decimals}f} s"

def compare(value, base):
    """Return x-times relation of value and base."""
    return f"{(value / base):.2f}x"


def display_results(executor, result_series):
    """Display results for Executor."""
    exe_str = str(executor).split(".")[-1].strip('\'>')
    print(f"\nresults for {exe_str}:\n")

    print(result_series.describe().to_string(), "\n")
    print(f"Minimum with {format_secs(result_series.min())}")
    print("-" * 60)

The benchmark functions below. For every single test out of n_runs, a fresh pipe is created. A new Process or Thread (an executor) starts and the target function calc_start_up_time immediately returns the time difference. That's all.

def calc_start_up_time(pipe_in, start):
    pipe_in.send(time.perf_counter() - start)
    pipe_in.close()


def run(executor, n_runs):

    results = []
    for _ in range(int(n_runs)):
        pipe_out, pipe_in = Pipe(duplex=False)
        exe = executor(target=calc_start_up_time, args=(pipe_in,
                                                    time.perf_counter(),))
        exe.start()
        # Note: Measuring only the time for exe.start() returning like:
        # start = time.perf_counter()
        # exe.start()
        # end = time.perf_counter()
        # would not include the full time a new process needs to become
        # production ready.
        results.append(pipe_out.recv())
        pipe_out.close()
        exe.join()

    result_series = pd.Series(results)
    display_results(executor, result_series)
    return result_series.min()

It's build to be started from terminal with the start_method and the number of runs passed as command line arguments. The benchmark will always run n_runs of a process start up with the specified start_method (available on Ubuntu 18.04: fork, spawn, forkserver) and then compare with n_runs of thread start up. Results are focusing on minimums because they show how fast is possible.

if __name__ == '__main__':

    # Usage:
    # ------
    # Start from terminal with start_method and number of runs as arguments:
    #   $python thread_vs_proc_start_up.py fork 100
    #
    # Get all available start methods on your system with:
    # >>>import multiprocessing as mp
    # >>>mp.get_all_start_methods()

    start_method, n_runs = sys.argv[1:]
    mp.set_start_method(start_method)

    mins = []
    for executor in [Process, Thread]:
        mins.append(run(executor, n_runs))
    print(f"Minimum start-up time for processes takes "
          f"{compare(*mins)} "
          f"longer than for threads.")


Results

with n_runs=1000 on my rusty machine:

# Ubuntu 18.04 start_method: fork
# ================================
results for Process:

count    1000.000000
mean        0.002081
std         0.000288
min         0.001466
25%         0.001866
50%         0.001973
75%         0.002268
max         0.003365 

Minimum with 1.47 ms
------------------------------------------------------------

results for Thread:

count    1000.000000
mean        0.000054
std         0.000013
min         0.000044
25%         0.000047
50%         0.000051
75%         0.000058
max         0.000319 

Minimum with 43.89 µs
------------------------------------------------------------
Minimum start-up time for processes takes 33.41x longer than for threads.

# Ubuntu 18.04 start_method: spawn
# ================================

results for Process:

count    1000.000000
mean        0.333502
std         0.008068
min         0.321796
25%         0.328776
50%         0.331763
75%         0.336045
max         0.415568 

Minimum with 321.80 ms
------------------------------------------------------------

results for Thread:

count    1000.000000
mean        0.000056
std         0.000016
min         0.000043
25%         0.000046
50%         0.000048
75%         0.000065
max         0.000231 

Minimum with 42.58 µs
------------------------------------------------------------
Minimum start-up time for processes takes 7557.80x longer than for threads.

# Ubuntu 18.04 start_method: forkserver
# =====================================


results for Process:

count    1000.000000
mean        0.295011
std         0.007157
min         0.287871
25%         0.291440
50%         0.293263
75%         0.296185
max         0.361581 

Minimum with 287.87 ms
------------------------------------------------------------

results for Thread:

count    1000.000000
mean        0.000055
std         0.000014
min         0.000043
25%         0.000045
50%         0.000047
75%         0.000064
max         0.000251 

Minimum with 43.01 µs
------------------------------------------------------------
Minimum start-up time for processes takes 6693.44x longer than for threads.

Upvotes: 9

DrM
DrM

Reputation: 2515

To answer your questions, we need to understand some basics of threading and multiprocessing in python. It turns out that the issue is not so much the startup overhead, but rather how each distributes the running load over the system resources.

First, threading in python is not like threading in Linux. Where Linux creates a new lightweight process for each thread, and these can run on different CPU cores, the python script and its threads all run in the same CPU core at any given instant. If you want true multiprocessing in python, you have to use the multi processing interface.

To demonstrate the above, run the Linux system monitor, select the resources tab, and then in a different terminal window, try running each of the two code snippets that I inserted below. The resources tab shows you the load on each CPU core.

The second important issue, is that you want to handle thousands of incoming connections at the same time. You are probably going to need the multiprocessing interface for that, but there may be limits on how many processes and connections you can accommodate, either as configured in Linux or by bottlenecks in scheduling or resources, c.f. hardware.

One way to handle this, if you choose to not have a very large number of processes active at one time, is to create a fixed number of processes, store them in a list, and then pass the incoming connections to them as they come in. When all of the processes are busy, you wait. You will need at least one counting semaphore for this.

If you want to create processes as the connections come in, you can again use a counting semaphore to limit how many processes are running at one time. You initialize the counting semaphore to the maximum number, decrement it for each processes that you create, and increment it when the process exits. As above, when you reach the maximum number of allowed processes, you wait.

Okay, here are the code examples for threading and multiprocessing. The first starts 5 threads. The second starts 5 processes. You can change these with one edit, to work your way up to 100, 1000, etc. The integer processing loops in each, let you see the load in the Linux system monitor program.

#!/usr/bin/python

# Parallel code with shared variables, using threads
from threading import Lock, Thread
from time import sleep

# Variables to be shared across threads
counter = 0
run = True
lock = Lock()

# Function to be executed in parallel
def myfunc():

    # Declare shared variables
    global run
    global counter
    global lock

    # Processing to be done until told to exit
    while run:
        n = 0
        for i in range(10000):
            n = n+i*i
        print( n )
        sleep( 1 )



        # Increment the counter
        lock.acquire()
        counter = counter + 1
        lock.release()

    # Set the counter to show that we exited
    lock.acquire()
    counter = -1
    lock.release()
    print( 'thread exit' )

# ----------------------------

# Launch the parallel function in a set of threads
tlist = []
for n in range(5):
    thread = Thread(target=myfunc)
    thread.start()
    tlist.append(thread)

# Read and print the counter
while counter < 5:
    print( counter )
    n = 0
    for i in range(10000):
        n = n+i*i
    print( n )
    #sleep( 1 )

# Change the counter    
lock.acquire()
counter = 0
lock.release()

# Read and print the counter
while counter < 5:
    print( counter )
    n = 0
    for i in range(10000):
        n = n+i*i
    print( n )
    #sleep( 1 )

# Tell the thread to exit and wait for it to exit
run = False

for thread in tlist:
    thread.join()

# Confirm that the thread set the counter on exit
print( counter )

And here's the multiprocessing version:

#!/usr/bin/python

from time import sleep
from multiprocessing import Process, Value, Lock

def myfunc(counter, lock, run):

    while run.value:
        sleep(1)
        n=0
        for i in range(10000):
            n = n+i*i
        print( n )
        with lock:
            counter.value += 1
            print( "thread %d"%counter.value )

    with lock:
        counter.value = -1
        print( "thread exit %d"%counter.value )

# -----------------------

counter = Value('i', 0)
run = Value('b', True)
lock = Lock()

plist = []
for n in range(5):
    p = Process(target=myfunc, args=(counter, lock, run))
    p.start()
    plist.append(p)


while counter.value < 5:
    print( "main %d"%counter.value )
    n=0
    for i in range(10000):
        n = n+i*i
    print( n )
    sleep(1)

with lock:
    counter.value = 0

while counter.value < 5:
    print( "main %d"%counter.value )
    sleep(1)

run.value = False

for p in plist:
    p.join()

print( "main exit %d"%counter.value)

Upvotes: 1

sehafoc
sehafoc

Reputation: 876

It depends... and perhaps "both" could be the answer you're looking for.

Multiprocess in python uses a standard fork() call in linux to copy the main process. In the case of your minimum program, that's probably not very much data, but depending on how the end program is structured it could be even more data to fork so to speak. In the minimal case the process memory overhead is pretty minimal.

Threading will not have this memory overhead problem, but it has has another potential problem other than startup time you may have to worry about... The GIL. The GIL probably won't be a problem if your treads are largely blocked waiting for I/O, but if you're just running a loop like you are in your test, only 2 threads will run at a time....

In other words; even though you got the same time in your test there is a lot going on there under the covers that a simple test like this won't catch.

The right answer for a running program probably isn't worrying about the startup time, but it will probably be more dependent on

  • What's each thread, or process going to be doing?
  • What memory and state does it need access to, will locking be a concern?
  • In python, will the GIL be problematic for the workload (are 2 running threads at a time going to be enough for the workload)
  • Is multiplying the process foot print by the number of processes an acceptable amount of memory

A basic rule of thumb I follow is that if the thread/process is mainly going to be blocked on I/O (waiting for network traffic or something) use a thread. If you have more computationally heavy requirements and memory isn't a concern use a process.

An exception of that rule is how to I want to handle the memory and state of the process or threads. When you start talking large numbers of threads, and processes like this you'll probably have memory access / lock contention to think about...

Realistically though, without more data it's hard to make a good suggestion. Concurrent programming is one of those things that many people do, but few really understand well (in my experience).

Some additional stuff to look into would probably be restructuring the process to require fewer threads. Typically when making network servers and clients I end up using threads and have a single listener and sender thread that either block on a queue or a socket waiting for something to do. You may want to have fewer listeners and senders just feeding queues thus limiting the overhead. I think there is a new asyncio lib in Python3.5+ that could simplify your life as well.

I know I didn't really answer your question, but I hope I provided some things to look up and check out.

Hope that helps!

Upvotes: 1

Related Questions