How to speed up Word2Vec's initial vocabulary scan for massive data?

Question

I now have a massive data corpus dataset, it has about 11 billion sentences, each sentence has about 10 words, divided into more than 12,000 files ending with .txt.gz. I wish to Skip--Gram it with Gensim's Word2Vec.I'm using Gensim's multi-file streaming method PathLineSentences to read the data.

sentences=PathLineSentences('path')
w2vModel=Word2Vec(sentences,
              vector_size=128, 
              window=5, 
              min_count=2,
              workers=24,
              epochs=200,
              sg=1,
              hs=1,
              batch_words=100000,
              compute_loss=True）

But I found a problem that the vocabulary scan phase before training is very slow (because it can only run in a single thread?), the following is a screenshot of the TOP command information： enter image description here it has been running like the above for almost 12 hours. Can the vocabulary scan at this stage be multithreaded or is there any other way to speed it up? Thank you all. Here is my callback:

class callback(CallbackAny2Vec):
def __init__(self,path_prefix):
    self.epoch = 0
    self.path_prefix = path_prefix
    self.loss_to_be_subed = 0
    self.start_time = datetime.now()
    self.end_time = datetime.now()

def on_epoch_begin(self,model):
    self.start_time = datetime.now()
    print('Epoch {} start'.format(self.epoch))

def on_epoch_end(self, model):
    self.end_time = datetime.now()
    loss = model.get_latest_training_loss()/100000
    loss_now = loss - self.loss_to_be_subed
    self.loss_to_be_subed = loss
    print('Loss after epoch {}: {}'.format(self.epoch, loss_now))
    print("Epoch {} end".format(self.epoch))
    output_path = '{}_epoch_{}.model'.format(self.path_prefix, self.epoch)
    model.save(output_path)
    print('duration of epoch {} is {}
'.format(self.epoch,self.end_time-self.start_time))
    self.epoch += 1

How to speed up Word2Vec's initial vocabulary scan for massive data?

Answers (1)

Related Questions

How to speed up Word2Vec&#39;s initial vocabulary scan for massive data?

Answers (1)

Related Questions

How to speed up Word2Vec's initial vocabulary scan for massive data?