Shoaibkhanz
Shoaibkhanz

Reputation: 2082

Doc2vec in gensim using csv

I am using the below gibberish review data to train a doc2vec model in gensim. I face 2 errors.

1st : TaggedDocument takes 2 argument, I am unable to pass the Sr field as the 2nd argument so I resort to simple character('tag') in order to proceed further.

2nd: When I reach near the end of the code into for loop I get the following error.

ValueError: You must specify either total_examples or total_words, for proper job parameters updationand progress calculations. The usual value is total_examples=model.corpus_count.

| Sr   | review                                                     |
|------|------------------------------------------------------------|
| 123  | This is frustrating                                        |
| 456  | I am eating in a bowl and this is frustrating              |
| 678  | Summer has come and the weather is hot and I feel very hot |
| 1234 | When will winter come back I love the cool weather         |

import pandas as pd
import numpy as np
import gensim

file = pd.read_csv('/Users/test_text.csv')

file1 = [line.split() for line in file.review]

sent = [gensim.models.doc2vec.TaggedDocument(lines,'tag') for lines in file1]
model = gensim.models.Doc2Vec(alpha=0.025, min_alpha=0.025,min_count=1)  
model.build_vocab(sent)
for epoch in range(10):
        model.train(sent)
        model.alpha -= 0.002
        model.min_alpha = model.alpha 

Upvotes: 1

Views: 2750

Answers (1)

amirouche
amirouche

Reputation: 7873

I am not sure how to do that with pandas. That said using the csv module you can do the following:

import csv
from gensim.models.doc2vec import TaggedDocument, Doc2Vec 

texts = csv.DictReader(open('test_text.csv'))
documents = [TaggedDocument(text['review'].split(), [text['Sr']])  for text in texts]
model = Doc2Vec(documents, vector_size=100, window=8, min_count=2, workers=7)

# Then you can infer new vector and compute most similar documents:
vector = model.infer_vector(['frustrating', 'bowl', 'nooddle'])
print(model.docvecs.most_similar([vector]))

It will output something like:

[('123', 0.07377214729785919),
 ('1234', 0.019198982045054436),
 ('456', 0.011939050629734993),
 ('678', -0.14281529188156128)]

In your case the dataset fits in memory so you don't need to use the API with which you started with.

Upvotes: 2

Related Questions