Ying Li
Ying Li

Reputation: 1

Can I use word-context-count pairs as input to gensim's Word2Vec

I want to train word2vec (using gensim) on a large corpus data. The information I have is only co-occurence of any two words. My data has the format of
word__tab__context_word__tab__Number
(e.g: danger of 10, meaning 'danger' and 'of' co-occured 10 times in a window size of 5 in the corpus) for each line. Does word2vec of gensim take such input? I have searched through gensim tutorials and havn't seen any examples like this.

Thanks a lot for help. Li

Upvotes: 0

Views: 355

Answers (1)

gojomo
gojomo

Reputation: 54173

Gensim doesn't take that as input; it expects actual text examples.

But, you could approximate skip-gram training by generating a synthetic corpus from your information.

For danger of 10, just generate 10 texts, each of ['danger', 'of']. (Gensim Word2Vec expects token-lists.) These synthetic texts will result in Word2Vec training encountering 10 skip-gram training-examples of 'danger' predicting 'of', and 10 skip-gram training-examples of 'of' predicting 'danger'. (So, if your co-occurrence lists also include of danger 10, you may want to discard those to avoid double-synthesis.)

It won't exactly be true skip-gram with a window of 5, because training on real texts randomly shrinks the window, to give closer words more weight – and your data doesn't include information on closeness. But it should be similar in results if you have no other options.

Upvotes: 0

Related Questions