Reputation: 1
I want to train word2vec (using gensim) on a large corpus data. The information I have is only co-occurence of any two words. My data has the format of
word__tab__context_word__tab__Number
(e.g: danger of 10, meaning 'danger' and 'of' co-occured 10 times in a window size of 5 in the corpus) for each line.
Does word2vec of gensim take such input? I have searched through gensim tutorials and havn't seen any examples like this.
Thanks a lot for help. Li
Upvotes: 0
Views: 355
Reputation: 54173
Gensim doesn't take that as input; it expects actual text examples.
But, you could approximate skip-gram training by generating a synthetic corpus from your information.
For danger of 10
, just generate 10 texts, each of ['danger', 'of']
. (Gensim Word2Vec
expects token-lists.) These synthetic texts will result in Word2Vec
training encountering 10 skip-gram training-examples of 'danger' predicting 'of', and 10 skip-gram training-examples of 'of' predicting 'danger'. (So, if your co-occurrence lists also include of danger 10
, you may want to discard those to avoid double-synthesis.)
It won't exactly be true skip-gram with a window
of 5, because training on real texts randomly shrinks the window, to give closer words more weight – and your data doesn't include information on closeness. But it should be similar in results if you have no other options.
Upvotes: 0