ljushu
ljushu

Reputation: 115

word2vec/gensim — RuntimeError: you must first build vocabulary before training the model

I am having trouble training my own word2vec model on the .txt files.

The code:

import gensim
import json
import pandas as pd
import glob
import gensim.downloader as api
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors


# loading the .txt files

sentences = []
sentence = []
for doc in glob.glob('./data/*.txt'): 
     with(open(doc, 'r')) as f:
        for line in f:
            line = line.rstrip()
            if line == "":
                if len(sentence) > 0:
                    sentences.append(sentence)
                    sentence = []
            else:
                cols = line.split("\t")
                if len(cols) > 4:
                    form = cols[1]
                    lemma = cols[2]
                    pos = cols[3]
                    if pos != "PONCT":
                        sentence.append(form.lower())


# trying to train the model

from gensim.models import Word2Vec
model_hugo = Word2Vec(sentences, vector_size=200, window=5, epochs=10, sg=1, workers=4)

Message error:

RuntimeError: you must first build vocabulary before training the model

How do I build the vocabulary?

The code works with the sample .conll files, but I want to train the model on my own data.

Upvotes: 0

Views: 1312

Answers (2)

ljushu
ljushu

Reputation: 115

Thanks to the @gojomo's suggestion and to this answer, I resolved the empty sentences issue. I needed the following block of code:

# make an iterator that reads your file one line at a time instead of reading everything in memory at once
# reads all the sentences

class SentenceIterator: 
    def __init__(self, filepath): 
        self.filepath = filepath 

    def __iter__(self): 
        for line in open(self.filepath): 
            yield line.split() 

before training the model:

# training the model

sentences = SentenceIterator('/content/drive/MyDrive/rousseau/rousseau_corpus.txt') 
model = gensim.models.Word2Vec(sentences, min_count=2) # min_count is for pruning 
                                                       # the internal dictionary. 
                                                       # Words that appear only once 
                                                       # in the corpus are probably 
                                                       # uninteresting typos and garbage. 
                                                       # In addition, there’s not enough 
                                                       # data to make any meaningful 
                                                       # training on those words, so it’s
                                                       # best to ignore them

Upvotes: 1

gojomo
gojomo

Reputation: 54173

Your sentences list is likely empty. The only line of code that adds anything to it requires line to be an empty string and sentence to be non-empty. Maybe that's never happening.

Check the value of sentences before creating the model. Make sure it has the expecten length, in number of texts, and look at the 1st few (say sentences[0:2]) to make sure they look OK. Each item in sentences should itself be a list-of-strings.

If it's not, debug your code that reads the files, and assembles the sentences sequence, until it looks as expected.

If you're still having problems, in either an edit to this question, or a followup question, be sure to:

  • show the entire error message you're receiving, including all lines of 'traceback' showing filenames, lines-of-code, & line-numbers
  • describe more about your corpus files, such as an example of some of its contents

Upvotes: 1

Related Questions