Reputation: 30
I have a general question on a specific topic.
I am using the vectors generated by Word2Vec to feed as features into my Distributed Random Forest model for classifying some records. I have millions of records and am receiving new records on a daily basis. Because of the new records coming in I want the new records to be encoded with the same vector model as the previous records. Meaning that the word "AT" will be the same vector now and in the future. I know that Word2Vec uses a random seed to generate the vectors for the words in the corpus but I want to turn this off. I need to set the seed such that if I train a model on a section of the data today and then again on the same data in the future, I want it to generate the same model with the exact same vectors for each word. The problem with generating new models and then encoding is that it takes a great deal of time to encode these records and then on top of that my DRF model for classification isn't any good anymore because the vector for the words have changed. So I have to retrain a new DRF. Normally this would not be an issue since I could just train one model each and then use that forever;However I know that a good practice is to update your packages on the regular. This is a problem for h2o since once you update there is no backward comparability with model generated on previous version.
Are there any sources that I could read on how to set the seed on the Word2Vec model for h2o in python? I am using Python version 3 and h2o version 3.18
Upvotes: 0
Views: 181
Reputation: 566
word2vec in h2o-3 uses hogwild implementation - the model parameters are updated concurrently from multiple threads and it is not possible to guarantee the reproducibility in this implementation.
How big is your text corpus? At the cost of a slowdown of the model training you could get reproducible result with limiting the algo to use just a single thread (h2o start-up parameter -nthread
).
Upvotes: 1