Generating a single dict using multiprocessing

Question

I'm using Python to construct a Markov chain generator. The chain model is constructed from the training data, and offers the ability to look up a sequence of words to find out what the next word is most likely to be.

The chain model is a dictionary, with tuple keys ("states", which represent a sequence of words), and dict values (which represent the choice of words that can come after that sequence). The choice dicts have string keys (which represent each word) and int values (which represent the frequency of that word). For example:

>>> make_model("I went to the shop then I went home then I went to bed")
{ (BEGIN, BEGIN): {"I", 1},
  (BEGIN, "I"): {"went", 1},
  ("I", "went"): {"to": 2, "home": 1},
  ("went", "to"): {"the": 1, "bed": 1},
  ("to", "the"): {"shop": 1},
  ... }

However, I'm trying to make model generation as fast as possible. In order to do this I'm trying to use the multiprocessing package. My attempt can be boiled down into 3 steps:

Split the corpus of sentences up into n segments, where n is the number of available processors.
Create a partial model per processor using multiprocessing.Pool().map.
Combine the models back into a single model.

1 and 2 are very fast. However, I'm struggling with the 3rd step. The only way I can think of to do this means using 3 nested for loops (partial models -> states -> choices) to make a single dict with states from all partial models and all word choice frequencies correctly summed -- on a single processor. But this approach, overall, is slower than doing the whole thing on a single processor (and then not needing step 3 at all).

I've tried to make the final model dict a multiprocessing.Manager().dict(), but that's far slower (I suspect because it's being passed around and locked/unlocked so much). I've tried to make the internal dicts instances of multiprocessing.Manager().dict() or multiprocessing.Value(), but multiprocessing does not allow me to create these object while programme flow is divided between multiple processors - I'd have to create them beforehand.

How can I implement multiprocessing in the formation of a single dict?

Generating a single dict using multiprocessing

Answers (1)

Related Questions