Reputation: 71
I am trying to fit a Word2Vec model. According to the documentation for Gensim's Word2Vec we do not need to call model.build_vocabulary
before using it.
But yet it is asking for me to do it. I have tried calling this function and it has not worked. I also fitted a Word2Vec model before without needing to call model.build_vocabulary
.
Am I doing something wrong? Here is my code:
from gensim.models import Word2Vec
dataset = pd.read_table('genemap_copy.txt',delimiter='\t', lineterminator='\n')
def row_to_sentences(dataframe):
columns = dataframe.columns.values
corpus = []
for index,row in dataframe.iterrows():
if index == 1000:
break
sentence = ''
for column in columns:
sentence += ' '+str(row[column])
corpus.append([sentence])
return corpus
corpus = row_to_sentences(dataset)
clean_corpus = [[sentence[0].lower()] for sentence in corpus ]
# model = Word2Vec()
# model.build_vocab(clean_corpus)
model = Word2Vec(clean_corpus, size=100, window=5, min_count=5, workers=4)
Help is greatly appreciated! Also I am using macOS Sierra. There is not much support online for using Gensim with Mac D: .
Upvotes: 2
Views: 7136
Reputation: 571
Is it that you are appending a new list containing a single sentence each time? corpus.append([sentence])
. You need to feed Word2Vec a series of sentences, but not necessarily sentences gathered by document. I'm also not clear on what is in your df but have you tokenised the sentences already?
My generator class I've used before for Word2Vec...
from nltk.tokenize import sent_tokenize
from gensim.utils import simple_preprocess
class MySentences(object):
def __init__(self, docs):
self.corpus = docs
def __iter__(self):
for doc in self.corpus:
doc_sentences = sent_tokenize(doc)
for sent in doc_sentences:
yield simple_preprocess(sent) # yields a tokenized
sentence ['like','this','one','.']
sentences = MySentences(df['text'].tolist())
model = gensim.models.Word2Vec(sentences, min_count=5, workers=8, size=300, sg=1)
Upvotes: 1
Reputation: 71
I think my problem was having the parameter min_count=5
so it was not considering most of my words if they did not appear more than 5 times.
Upvotes: 5
Reputation: 320
Try with LineSentence
:
from gensim.models.word2vec import LineSentence
and then train your corpus with
model = Word2Vec(LineSentence(clean_corpus), size=100, window=5, min_count=5, workers=4)
Upvotes: 1