Reputation: 1
I now have a massive data corpus dataset, it has about 11 billion sentences, each sentence has about 10 words, divided into more than 12,000 files ending with .txt.gz. I wish to Skip--Gram it with Gensim's Word2Vec.I'm using Gensim's multi-file streaming method PathLineSentences to read the data.
sentences=PathLineSentences('path')
w2vModel=Word2Vec(sentences,
vector_size=128,
window=5,
min_count=2,
workers=24,
epochs=200,
sg=1,
hs=1,
batch_words=100000,
compute_loss=True)
But I found a problem that the vocabulary scan phase before training is very slow (because it can only run in a single thread?), the following is a screenshot of the TOP command information: enter image description here it has been running like the above for almost 12 hours. Can the vocabulary scan at this stage be multithreaded or is there any other way to speed it up? Thank you all. Here is my callback:
class callback(CallbackAny2Vec):
def __init__(self,path_prefix):
self.epoch = 0
self.path_prefix = path_prefix
self.loss_to_be_subed = 0
self.start_time = datetime.now()
self.end_time = datetime.now()
def on_epoch_begin(self,model):
self.start_time = datetime.now()
print('Epoch {} start'.format(self.epoch))
def on_epoch_end(self, model):
self.end_time = datetime.now()
loss = model.get_latest_training_loss()/100000
loss_now = loss - self.loss_to_be_subed
self.loss_to_be_subed = loss
print('Loss after epoch {}: {}'.format(self.epoch, loss_now))
print("Epoch {} end".format(self.epoch))
output_path = '{}_epoch_{}.model'.format(self.path_prefix, self.epoch)
model.save(output_path)
print('duration of epoch {} is {}\n'.format(self.epoch,self.end_time-self.start_time))
self.epoch += 1
Upvotes: 0
Views: 278
Reputation: 54153
The initial vocabulary-scan is unfortunately single-threaded: it must read all data once, tallying up all words, to determine which rare words will be ignored, and rank all other words in frequency order.
You can have more control over the process if you refrain from passing the sentences
corpus to the initial constructor. Instead, leave it off then call the later steps .build_vocab()
& .train()
yourself. (The .build_vocab()
step will be the long single-threaded step.) Then, you have the option of saving the model after the .build_vocab()
has completed. (Potentially, then, you could re-load it, tinker with some settings, and run other training sessions without requiring a full repeated scan.)
Also, if you're just starting out, I'd recommend doing initial trials with a smaller dataset - perhaps just some subsampled 1/10th or 1/20th of your whole corpus – so that you can get your process working, & optimized somewhat, before attempting the full training.
Separately, regarding your implied setup:
min_count=2
is usually a bad idea with Word2Vec
& related algorithms. The model can only achieve useful vectors for words with a variety of multiple usages - so the class's default of min_count=5
is a good minimum value, and when using a larger corpus (like yours) it makes more sense to increase this floor than lower it. (While increasing min_count
won't speed the vocabulary-survey, it will speed training & typically improves the quality of the remaining words' vectors, because without the 'noise' of rare words, other words' training goes better.)Word2Vec
training throughput usually maxes out with somewhere in the range of 6-12 workers (largely due to Python GIL bottlenecks). Higher values slow things down. (Unfortunately, the best value can only be found via trial & error – starting training & observing the logged rate over a few minutes – and the optimal number of workers will change with other settings like window
or negative
.)epochs=200
is overkill (and likely to take a very long time). With a large dataset, you're more likely to be able to use less than the default epochs=5
than you'll need to use more.hs=1
without also setting negative=0
, you've enabled hierarchical-softmax training while leaving the default negative-sampling active. That's likely to at least double your training time, & make your model much larger, for no benefit. With a large dataset, it's odd to consider hs=0
mode at all – it becomes less performant with larger models. (You should probably just avoid touching the hs
value unless you're sure you need to.)batch_words
& compute_loss
. (The loss-tallying will slow things down, but also doesn't work very well yet - so it's rare to need.) In general, your setup changes a lot of things best left untouched, unless/until you're sure you can measure the net effects of the changes.Upvotes: 1