Jin
Jin

Reputation: 1223

How to Parallelize my python code

I have a large file as input to my python code and it will produce the corresponding output file. However, it takes too much time and I want to speed it up.

Right now, I split the large file into 1000 smaller files. I want to have a small script that will launch 1000 threads, each thread uses my original python code and has its own output file.

Can anyone give me a sample/example code?

Upvotes: 0

Views: 206

Answers (3)

Cld
Cld

Reputation: 481

  • If you don't have 1000 processors, split in 1000 have no interest... On contrary, big overhead...
  • multithreading is for manage I/O blocking more efficiently, not to parallelize processing work.
  • If your problem are I/O from the same device, making more will increase load on it and increase overhead (head moving, caching trash...)

What you searching is more multiprocessing: https://docs.python.org/2/library/multiprocessing.html

Upvotes: 1

abarnert
abarnert

Reputation: 365717

First, using 1000 threads will almost certainly slow things down, not speed it up. Even if your code is completely I/O bound, 1000 is pushing the limits of many platforms' schedulers, and you'll spend more time context switching than doing actual work.

Next, you need to know whether your code is CPU-bound (that is, doing actual processing on information in memory) or I/O-bound (that is, waiting on things like disk reads and writes).


If your code is CPU-bound, and you can keep the CPU busy pretty consistently, you want exactly 1 thread per core. That way, you get the maximum amount of parallelism with the minimum amount of context switching (and cache thrashing, assuming most of the work is done on either immutable or non-shared values).

Also (unless that work is being done in specially-designed C extensions like numpy), you want these threads to be in separate processes, because only 1 thread per process can run the Python interpreter at a time, thanks to the Global Interpreter Lock.

So, what you want is almost certainly a process pool. The easiest way to do that is to use the concurrent.futures.ProcessPoolExecutor, possibly with a max_workers argument (maybe start with 16, then try tweaking it up and down to see if it helps).


If, on the other hand, your code is mostly I/O-bound, then a couple dozen threads is reasonable, especially if the delays are unpredictable, but not 1000. And threads in the same process will work fine, because one thread can run the Python interpreter while the others are all waiting for the OS to finish a disk operation.

So, in this case, you want a concurrent.futures.ThreadPoolExecutor.


If you're not sure, and don't know how to find out, build it with a thread pool first, then use ActivityMonitor or whatever Windows now calls its process manager or your favorite of the 300 options on Linux to watch it run; if you end up with one core at 100% and the others below 25%, then you're too CPU-bound to be using threads. Fortunately, switching to a process pool is a trivial change—replace ThreadPoolExecutor with ProcessPoolExecutor, and remove the max_workers argument so Python will pick the best default, and now you're done.


In either case, the examples in the docs are good enough that there's no reason to ask for other sample code.

Upvotes: 5

Vor
Vor

Reputation: 35109

If you decided to go with multiprocessing then you will do it in a very similar way. You can try something like this:

import Queue
from threading import Thread

file_list = ['filea', 'fileb']

def do_stuff(q):
    while True:
        try:
            file_name = q.get(False)
        except Queue.Empty:
            # Handle empty queue here
            break
        # do what ever you need here
        print file_name
        q.task_done()

q = Queue.Queue(maxsize=0)
num_threads = 2

for x in file_list:
  q.put(x)

for i in range(num_threads):
  worker = Thread(target=do_stuff, args=(q,))
  worker.setDaemon(True)
  worker.start()

q.join()

Upvotes: 1

Related Questions