Reputation: 4071
I have a list, L and large dictionary, D that contains keys, K where k is a specific key in K. D[k] contains some information that I need to determine a list of results. Right now I am searching through each value in K and if the information there is of value to me I append k to L. This works in the iterative fashion but I am hoping that I can speed it up with multi-threading. There will never be an update to the dictionary. What would be a good way to go about implementing this?
Upvotes: 0
Views: 1483
Reputation: 884
The general idea is a MapReduce or Producer-Consumer pattern:
Map Phase/Producer: Divide up the search space by the number of processes you want to spawn (i,e., for four processes, each process gets 1/4th of the keys as well as a reference to the dictionary).
Reduce/Consumer: When the process finds a hit it sends the value to a threadsafe queue.
When all the processes are done working, your queue will hold the results.
Almost certainly the most labor free method you will find would be to use the multiprocessing.Pool.map
function (docs)
Upvotes: 2
Reputation: 26766
Depending on the Python interpreter you're using, and exactly what methods you're using when searching the dictionary, multithreading will probably not speed things up. cpython's Global Interpreter Lock (GIL) means that only one thread can execute python code at a given time.
Now, if you're using libraries written in C and optimised for performance, they may release the GIL while doing mathematical heavy lifting (NumPy is a good example). The same applies for threads waiting on I/O. Beyond that, you're likely to end up going slower with multiple threads as there's an overhead involved in switching threading contexts.
In Python, you usually get a better result using multi-processing. Each process will have its own GIL and so code can run in parallel. Assuming your dictionary really is read only, it's easy enough to give a copy of the dictionary to each process that is spawned.
The downside to multi-processing is that there's more overhead involved in communicating between threads, so the more isolated the execution, the better the results you'll see. It's also worth noting that Windows tends to have a higher cost associated with spawning new processes, but this shouldn't be an issue with anything CPU-bound as the number of processes you'll have will likely be very small.
Upvotes: 5