Can we build word2vec model in a distributed way?

Currently I have 1.2tb text data to build gensim's word2vec model. It is almost taking 15 to 20 days to complete.

I want to build model for 5tb of text data, then it might take few months to create model. I need to minimise this execution time. Is there any way we can use multiple big systems to create model?

Please suggest any way which can help me in reducing the execution time.

FYI, I have all my data in S3 and I use smart_open module to stream the data.

Upvotes: 3

Views: 1007

Answers (2)

Guillaume Massé
Guillaume Massé

Reputation: 8064

You can use Apache Spark. https://javadoc.io/doc/org.apache.spark/spark-mllib_2.12/latest/org/apache/spark/mllib/feature/Word2Vec.html

Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

Apache/Spark at e053c55

Upvotes: 1

Harman
Harman

Reputation: 1208

Training a model with a huge corpus will surely take a very long time because of a large number of weights involved. Suppose your word vectors have 300 components and your vocabulary size is 10,000. The size of weight matrix would be 300*10000 = 3 million!

To build a model for huge datasets I would recommend you to first preprocess the dataset. Following preprocessing steps can be applied:

  • Removing stop words.
  • Treating word pairs or phrases as single words, like new york as new_york, etc.
  • Subsampling frequent words to decrease the number of training examples.
  • Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights.

The above tasks were also done in official word2vec implementation released by Google. Gensim provides very beautiful high-level APIs to perform most of above tasks. Also, have a look at this blog for further optimizing techniques.

One more thing that can be done is instead of training your own model use the already trained word2vec model released by Google It’s 1.5GB and includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset.

Upvotes: 0

Related Questions