Python multi-thread communication efficiency

Question

I'm new to python multitasking. I'm doing it in old-fashioned way:

I'm inheriting from threading.Thread and using queue.Queue queues to send messages to/from the main thread.

This is my base threaded class:

class WorkerGenerico(threading.Thread):
    def __init__(self, task_id, input_q=None, output_q=None, keep_alive=300):
        super(WorkerGenerico, self).__init__()
        self._task_id = task_id
        if input_q is None:
            self._input_q = queue.Queue()
        else:
            if isinstance(input_q, queue.Queue):
                self._input_q = input_q
            else:
                raise TypeError("input_q debe ser del tipo queue.Queue")
        if output_q is None:
            self._output_q = queue.Queue()
        else:
            if isinstance(output_q, queue.Queue):
                self._output_q = output_q
            else:
                raise TypeError("input_q debe ser del tipo queue.Queue")
        if not isinstance(keep_alive, int):
            raise TypeError("El valor de keep_alive debe der un int.")
        self._keep_alive = keep_alive
        self.stoprequest = threading.Event()

    # def run(self):
    #    Implement a loop in subclases which checks if self.has_orden_parada() is true in order to stop.

    def join(self, timeout=None):
        self.stoprequest.set()
        super(WorkerGenerico, self).join(timeout)

    def gracefull_stop(self):
        self.stoprequest.set()

    def has_orden_parada(self):
        return self.stoprequest.is_set()

    def put(self,texto, block=True, timeout=None):
        return self._input_q.put(texto, block=block, timeout=timeout)

    def get(self, block=True, timeout=None):
        return self._output_q.get(block=block, timeout=timeout)

My question is how expensive it is to call WorkerGenerico.get() from the outside, as oposed to storing que queue in the main thread and using Queue.get(). Both methods look similar in performance with small non-frequent control messages, however, I guess that very frequent calls would make method B worth the usage.:

I guess mode A is more resource-consuming (it has to somehow call the method from an outer thread and pass the queue definition back, I guess that the loss depends on Python implementation), however, final code is more readable and intuitive.

If I had to tell from experience with other languages, I'd say method B is much better, an I right?

Method A:

def main()
    worker = WorkerGenerico(task_id=1)
    worker.start()
    print(worker.get())

Method B:

def main()
    input_q = Queue()
    output_q = Queue()
    worker = WorkerGenerico(task_id=1, input_q=input_q, output_q=output_q)
    worker.start()
    print(output_q.get())

BTW: For completeness, I'd like to share the way I'm doing it now. Its a mixture of both methods which offers a nice envelope to threads:

class EnvoltorioWorker:
    def __init__(self, task_id, input_q=None, output_q=None, keep_alive=300):
        if input_q is None:
            self._input_q = queue.Queue()
        else:
            if isinstance(input_q, queue.Queue):
                self._input_q = input_q
            else:
                raise TypeError("input_q debe ser del tipo queue.Queue")
        if output_q is None:
            self._output_q = queue.Queue()
        else:
            if isinstance(output_q, queue.Queue):
                self._output_q = output_q
            else:
                raise TypeError("input_q debe ser del tipo queue.Queue")
        self.worker = WorkerGenerico(task_id, input_q, output_q, keep_alive)

    def put(self, elem, block=True, timeout=None):
        return self._input_q.put(elem, block=block, timeout=timeout)

    def get(self, block=True, timeout=None):
        return self._output_q.get(block=block, timeout=timeout)

I use EnvoltorioWorker.worker.* to call joins or other outher control methods and EnvoltorioWorker.get / EnvoltorioWorker.put to communicate with the inner class properly, just like this:

def main()
    worker_container = EnvoltorioWorker(task_id=1)
    worker_container.worker.start()
    print(worker_container.get())

Normally I also make interfaces for start(), join() and nonwait_stop() in EnvoltorioWorker if no other access to worker is needed.

It may look dummy, and there are probably better ways to achieve this, so:

Which method (A or B) is a better practise? Is inheriting from Thread the proper way to handle threads in Python?. I'm using dispycos for distributed environments and similar envelopes to communicate with my threads

EDIT: Just noticed I forgot to translate comments and some strings in classes, but they're simple enough so I think it is readable. I'll edit it when I've time.

Any thoughts?

Darkonaut · Accepted Answer

Your queue is not really stored in the thread. Assuming CPython here, all objects are stored on the heap and threads only have a private stack. Objects on the heap are shared across all threads in the same process.

Memory management in Python involves a private heap containing all Python objects and data structures. The management of this private heap is ensured internally by the Python memory manager. The Python memory manager has different components which deal with various dynamic storage management aspects, like sharing, segmentation, preallocation or caching. docs

From this follows that it's not a question where your object (your queue) is located, because it's always on the heap. Variables (names) in Python are just references to these objects.

The things here, which affect your runtime are how many call frames you add to your stacks by nesting function / method calls and how ~many bytecode instructions you need. So what impact on timings does this have?

Benchmark

Consider the following dummy setup for a queue and a worker. The dummy-worker is not threaded here for simplicity because threading it doesn't affect timings in a scenario where we pretend to just drain a pre-filled queue.

class Queue:
    def get(self):
        return 1

class Worker:
    def __init__(self, queue):
        self.queue = queue
        self.quick_get = self.queue.get # a reference to a method as instance attribute

    def get(self):
        return self.queue.get()

    def quick_get_method(self):
        return self.quick_get()

How you can see, Worker has two versions of get-methods, get in a way how you define it and quick_get_method, which is one bytecode instruction shorter how we'll see later. The worker instance not only holds a reference to the queue instance, but also directly to queue.get via self.quick_get, which is where we spare one instruction.

Now the timings for benchmarking all possibilities to .get() from the fake-queue within an IPython session:

q = Queue()
w = Worker(q)

%timeit q.get()
285 ns ± 1.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit w.get()
609 ns ± 2.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit w.quick_get()
286 ns ± 0.756 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit w.quick_get_method()
555 ns ± 0.855 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Note that there is no difference in timings between q.get() and w.quick_get(). Also note the improved timing of w.quick_get_method() compared to the traditional w.get(). Using a Worker-method to call get() on the queue still nearly doubles the timings compared to q.get() and w.quick_get(). Why is that?

It's possible to get a human readable version of the Python bytecode instructions the interpreter is working on, by using the dis module.

import dis

dis.dis(q.get)
  3           0 LOAD_CONST               1 (1)
              2 RETURN_VALUE

dis.dis(w.get)
  8           0 LOAD_FAST                0 (self)
              2 LOAD_ATTR                0 (queue)
              4 LOAD_METHOD              1 (get)
              6 CALL_METHOD              0
              8 RETURN_VALUE

dis.dis(w.quick_get)
  3           0 LOAD_CONST               1 (1)
              2 RETURN_VALUE

dis.dis(w.quick_get_method)
 11           0 LOAD_FAST                0 (self)
              2 LOAD_METHOD              0 (quick_get)
              4 CALL_METHOD              0
              6 RETURN_VALUE

Keep in mind our dummy Queue.get here is just returning 1. You see that q.get is just the same as w.quick_get, what is also reflected in the timings we saw before. Note that w.quick_get_method directly loads quick_get, which is just another name / variable for the object queue.get is referencing.

You can also get the depth of the stack printed with the help of the dis module:

def print_stack_depth(f):
    print(*[s for s in dis.code_info(f).split('
') if
            s.startswith('Stack size:')]
    )

print_stack_depth(q.get)
Stack size:        1 
print_stack_depth(w.get)
Stack size:        2
print_stack_depth(w.quick_get)
Stack size:        1
print_stack_depth(w.quick_get_method)
Stack size:        2

Bytecode and timing differences between the different approaches imply (less surprising), that adding another frame (by adding another method) accounts for the biggest performance hit.

Review

The analysis above is not an implicit plea for not using extra Worker-methods to call methods on referenced objects (queue.get). For readability, logging and easier debugging doing this is just the right thing to do. Optimizations like Worker.quick_get_method you will, for example, also find in Stdlib's multiprocessing.pool.Pool, which also uses queues internally.

To put the timings from the benchmark into perspective, a few hundred nanoseconds is not much (for Python). In Python 3 the default maximum time interval a thread can hold the GIL and hence, execute bytecode at a stretch, is 5 milliseconds. That's 5*1000*1000 nanoseconds.

A few hundred nanoseconds is also small compared to the overhead multi-threading introduces anyway. I found, for example, that adding a 20 μs sleep after a queue.put(integer) in one thread and just reading from the queue in another thread, led to an additional overhead of about 64.0 μs per iteration on average (the 20 μs sleep not included) over a range of 100k (Python 3.7.1, Ubuntu 18.04).

Design

Regarding your question about design preference, I would definitely pick Method A here over Method B. Even more in case your queues are not used across multiple threads anyway. IMO your mixed creation in the last snippet is unnecessarily complicating things / understanding in a case where you just use one WorkerGenerico instance internally (not a pool of worker-threads). Contrary to Method A, the "threadiness" of your worker here is also buried deep inside another class.

Python multi-thread communication efficiency

Answers (1)

Related Questions