Jonathan Scott
Jonathan Scott

Reputation: 71

Gensim Word2Vec 'you must first build vocabulary before training the model'

I am trying to fit a Word2Vec model. According to the documentation for Gensim's Word2Vec we do not need to call model.build_vocabulary before using it. But yet it is asking for me to do it. I have tried calling this function and it has not worked. I also fitted a Word2Vec model before without needing to call model.build_vocabulary .

Am I doing something wrong? Here is my code:

from gensim.models import Word2Vec
dataset = pd.read_table('genemap_copy.txt',delimiter='\t', lineterminator='\n')

def row_to_sentences(dataframe):
    columns = dataframe.columns.values
    corpus = []
    for index,row in dataframe.iterrows():
        if index == 1000:
            break
        sentence = ''
        for column in columns:
            sentence += ' '+str(row[column])
        corpus.append([sentence])
    return corpus

corpus = row_to_sentences(dataset)
clean_corpus = [[sentence[0].lower()] for sentence in corpus ]


# model = Word2Vec()
# model.build_vocab(clean_corpus)
model = Word2Vec(clean_corpus, size=100, window=5, min_count=5, workers=4)

Help is greatly appreciated! Also I am using macOS Sierra. There is not much support online for using Gensim with Mac D: .

Upvotes: 2

Views: 7136

Answers (3)

James Allen-Robertson
James Allen-Robertson

Reputation: 571

Is it that you are appending a new list containing a single sentence each time? corpus.append([sentence]). You need to feed Word2Vec a series of sentences, but not necessarily sentences gathered by document. I'm also not clear on what is in your df but have you tokenised the sentences already?

My generator class I've used before for Word2Vec...

from nltk.tokenize import sent_tokenize
from gensim.utils import simple_preprocess

class MySentences(object):
    def __init__(self, docs):
        self.corpus = docs
    def __iter__(self):
        for doc in self.corpus:
            doc_sentences = sent_tokenize(doc)
            for sent in doc_sentences:
                yield simple_preprocess(sent) # yields a tokenized 

sentence ['like','this','one','.']

sentences = MySentences(df['text'].tolist())
model = gensim.models.Word2Vec(sentences, min_count=5, workers=8, size=300, sg=1)

Upvotes: 1

Jonathan Scott
Jonathan Scott

Reputation: 71

I think my problem was having the parameter min_count=5 so it was not considering most of my words if they did not appear more than 5 times.

Upvotes: 5

EtienneG
EtienneG

Reputation: 320

Try with LineSentence:

from gensim.models.word2vec import LineSentence

and then train your corpus with

model = Word2Vec(LineSentence(clean_corpus), size=100, window=5, min_count=5, workers=4)

Upvotes: 1

Related Questions