Reputation: 2425
I'm using Python to construct a Markov chain generator. The chain model is constructed from the training data, and offers the ability to look up a sequence of words to find out what the next word is most likely to be.
The chain model is a dictionary, with tuple keys ("states", which represent a sequence of words), and dict values (which represent the choice of words that can come after that sequence). The choice dicts have string keys (which represent each word) and int values (which represent the frequency of that word). For example:
>>> make_model("I went to the shop then I went home then I went to bed")
{ (BEGIN, BEGIN): {"I", 1},
(BEGIN, "I"): {"went", 1},
("I", "went"): {"to": 2, "home": 1},
("went", "to"): {"the": 1, "bed": 1},
("to", "the"): {"shop": 1},
... }
However, I'm trying to make model generation as fast as possible. In order to do this I'm trying to use the multiprocessing
package. My attempt can be boiled down into 3 steps:
n
segments, where n
is the number of available processors.multiprocessing.Pool().map
.1 and 2 are very fast. However, I'm struggling with the 3rd step. The only way I can think of to do this means using 3 nested for
loops (partial models -> states -> choices) to make a single dict with states from all partial models and all word choice frequencies correctly summed -- on a single processor. But this approach, overall, is slower than doing the whole thing on a single processor (and then not needing step 3 at all).
I've tried to make the final model dict a multiprocessing.Manager().dict()
, but that's far slower (I suspect because it's being passed around and locked/unlocked so much). I've tried to make the internal dicts instances of multiprocessing.Manager().dict()
or multiprocessing.Value()
, but multiprocessing
does not allow me to create these object while programme flow is divided between multiple processors - I'd have to create them beforehand.
How can I implement multiprocessing in the formation of a single dict?
Upvotes: 2
Views: 139
Reputation: 179452
You could maybe make it slightly faster by using collections.Counter
s as your keys:
from collections import Counter
d1 = {("a", "b"): Counter({"c": 3, "d": 4})}
d2 = {("a", "b"): Counter({"c": 5, "e": 6})}
d1['a','b'] += d2['a','b']
# d1 is now {('a', 'b'): Counter({'c': 8, 'e': 6, 'd': 4})}
Counter
s merge naturally so they might be a little faster. In terms of code length, this is definitely a lot nicer:
final = collections.defaultdict(Counter)
for d in results:
for key in d:
final[key] += d[key]
but it might not actually be substantially faster if there's a lot of data to merge.
Upvotes: 1