Reputation: 3
I am trying to do my own word embedding using the package word2vec (https://pypi.org/project/word2vec/). However, I can't find the file format of the input file for the function "word2vec".
I tried .txt format and pickle file but neither does work.
For example, where corpus.txt has been made with the Windows Notepad and contains "I am a foo bar corpus test"
import word2vec
word2vec.word2vec("corpus.txt", "corpus.bin", size=100, verbose=True)
I would have expected:
Vocab size: 7
Words in train file: 7
as in the example here : https://nbviewer.jupyter.org/github/danielfrg/word2vec/blob/master/examples/word2vec.ipynb
but got only
Vocab size: 1
Words in train file: 0
Does anyone knows which type/format of file this function accepts ?
Thank you in advance !
Upvotes: 0
Views: 528
Reputation: 54153
There's a good chance your specific results are because most word2vec implementations discard all words that appear fewer than some minimum-count value, usually 5. (Word2Vec doesn't create good vectors for such rare words, and their presence usually interferes with better vectors for other more-common words, so discarding them is usually a good idea on real-sized corpuses.)
So a toy-sized input file, of just 7 words each appearing once, leaves nothing but (maybe) one synthetic word.
Because that PyPI package appears to be a thin wrapper around the word2vec.c
code originally released by Google, you could probably refer to that code to learn more details about formats/usage.
But, you could also use the Word2Vec
implementation in the Gensim library - a far more common choice when using Python, with much more documentation & flexibility.
Upvotes: 0