Reputation: 115
I am having trouble training my own word2vec
model on the .txt
files.
The code:
import gensim
import json
import pandas as pd
import glob
import gensim.downloader as api
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors
# loading the .txt files
sentences = []
sentence = []
for doc in glob.glob('./data/*.txt'):
with(open(doc, 'r')) as f:
for line in f:
line = line.rstrip()
if line == "":
if len(sentence) > 0:
sentences.append(sentence)
sentence = []
else:
cols = line.split("\t")
if len(cols) > 4:
form = cols[1]
lemma = cols[2]
pos = cols[3]
if pos != "PONCT":
sentence.append(form.lower())
# trying to train the model
from gensim.models import Word2Vec
model_hugo = Word2Vec(sentences, vector_size=200, window=5, epochs=10, sg=1, workers=4)
Message error:
RuntimeError: you must first build vocabulary before training the model
How do I build the vocabulary?
The code works with the sample .conll
files, but I want to train the model on my own data.
Upvotes: 0
Views: 1312
Reputation: 115
Thanks to the @gojomo's suggestion and to this answer, I resolved the empty sentences
issue. I needed the following block of code:
# make an iterator that reads your file one line at a time instead of reading everything in memory at once
# reads all the sentences
class SentenceIterator:
def __init__(self, filepath):
self.filepath = filepath
def __iter__(self):
for line in open(self.filepath):
yield line.split()
before training the model:
# training the model
sentences = SentenceIterator('/content/drive/MyDrive/rousseau/rousseau_corpus.txt')
model = gensim.models.Word2Vec(sentences, min_count=2) # min_count is for pruning
# the internal dictionary.
# Words that appear only once
# in the corpus are probably
# uninteresting typos and garbage.
# In addition, there’s not enough
# data to make any meaningful
# training on those words, so it’s
# best to ignore them
Upvotes: 1
Reputation: 54173
Your sentences
list is likely empty. The only line of code that adds anything to it requires line
to be an empty string and sentence
to be non-empty. Maybe that's never happening.
Check the value of sentences
before creating the model. Make sure it has the expecten length, in number of texts, and look at the 1st few (say sentences[0:2]
) to make sure they look OK. Each item in sentences
should itself be a list-of-strings.
If it's not, debug your code that reads the files, and assembles the sentences
sequence, until it looks as expected.
If you're still having problems, in either an edit to this question, or a followup question, be sure to:
Upvotes: 1