Multithreading / Multiprocessing in Python

Question

I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.

Link to program: http://pastebin.com/KqEMMMCT

What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.

What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)

I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.

Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.

I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!

FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.

Spencer Rathbun · Accepted Answer

Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.

I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.

This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.

Also on the same site, check out the coroutines section.

Multithreading / Multiprocessing in Python

Answers (2)

Related Questions