Reputation: 191
The idea is to update a particular pre-trained word2vec model with different sets of new corpus. I have the following
# c1, c2 are each a list of 100 files
filelist = [c1, c2, c3, c4, c5, c6, c7, c8, c9, c10]
def update_model(files):
# loading a pre-trained model
trained_model = gensim.models.Word2Vec.load("model_both_100")
# Document feeder is an iterable
docs = DocumentFeeder(files)
trained_model.build_vocab(docs, update=True)
trained_model.train(docs, total_examples=trained_model.corpus_count, epochs=trained_model.epochs)
with Pool(processes=10) as P:
P.map(update_model, filelist)
it takes about ~13 minutes to run. But the non-parallel version (looping over filelist
) takes ~11 min. Why is this happening? Running on a 12 core cpu.
Upvotes: 0
Views: 72
Reputation: 54173
Gensim's Word2Vec
training already uses multiple threads – depending on the workers
parameter at model creation. (The default is to use workers=3
, but your model may have been initialized to use even more.)
So you are launching 10 (heavyweight) processes, each separately loading a full-size model. That could easily trigger heavy memory usage & thus virtual-memory swapping.
Then each of those models does its own (single-threaded) vocabulary-expansion, then its (one manager thread and 3 or more worker threads) training. If they're all in training simultaneously, than means 40 threads active, within 10 OS processes, on your 12 core processor. There's no reason to necessarily expect a speedup in such a situation, and the contention of more-threads-than-cores, and all contending for access to totally different loaded model memory ranges, could easily explain a slowdown.
Are you really trying to create 10 separate incrementally-updated models? (Do they get re-saved to 10 different filenames after the update-training?)
Upvotes: 1