What is the input file format for the function word2vec from package word2vec?

Question

I am trying to do my own word embedding using the package word2vec (https://pypi.org/project/word2vec/). However, I can't find the file format of the input file for the function "word2vec".

I tried .txt format and pickle file but neither does work.

For example, where corpus.txt has been made with the Windows Notepad and contains "I am a foo bar corpus test"

import word2vec
word2vec.word2vec("corpus.txt", "corpus.bin", size=100, verbose=True)

I would have expected:

Vocab size: 7
Words in train file: 7

as in the example here : https://nbviewer.jupyter.org/github/danielfrg/word2vec/blob/master/examples/word2vec.ipynb

but got only

Vocab size: 1
Words in train file: 0

Does anyone knows which type/format of file this function accepts ?

Thank you in advance !

gojomo · Accepted Answer

There's a good chance your specific results are because most word2vec implementations discard all words that appear fewer than some minimum-count value, usually 5. (Word2Vec doesn't create good vectors for such rare words, and their presence usually interferes with better vectors for other more-common words, so discarding them is usually a good idea on real-sized corpuses.)

So a toy-sized input file, of just 7 words each appearing once, leaves nothing but (maybe) one synthetic word.

Because that PyPI package appears to be a thin wrapper around the word2vec.c code originally released by Google, you could probably refer to that code to learn more details about formats/usage.

But, you could also use the Word2Vec implementation in the Gensim library - a far more common choice when using Python, with much more documentation & flexibility.

What is the input file format for the function word2vec from package word2vec?

Answers (1)

Related Questions