jxn
jxn

Reputation: 8055

python gensim retrieve original sentences from doc2vec taggedlinedocument

I am using Gensim's doc2vec method to read in my text file which contains 1 sentence per line. It reads my file into a dictionary where the keys are a tokenized list of terms and the values are the sentence number.

Here is my code:

from gensim import utils
from gensim.models.doc2vec import LabeledSentence,TaggedLineDocument
from gensim.models import Doc2Vec
new_file = open('new_file.txt','w')
with open('myfile.txt','r') as inp:
    for line in inp:
        utils.simple_preprocess(line)
        file1.write(str(utils.simple_preprocess(line)) + "\n")
file1.close()

Example output of new file:

[u'hi', u'how', u'are', u'you']
[u'its', u'such', u'great', u'day']
[u'its', u'such', u'great', u'day']
[u'its', u'such', u'great', u'day']

Then i feed that list into gensim's taggedlinedocument function:

s = TaggedLineDocument('myfile.txt')
for k,v in s:
    print k, v

Example Output:

[u'hi', u'how', u'are', u'you'] [0]
[u'hi', u'how', u'are', u'you'] [1]
[u'hi', u'how', u'are', u'you'] [2]
[u'its', u'such', u'a', u'great', u'day'] [3]
[u'its', u'such', u'a', u'great', u'day'] [4]

Question is, given the tag id (example 0), how do i get back the original sentence?

Upvotes: 2

Views: 630

Answers (1)

gojomo
gojomo

Reputation: 54243

Gensim's Word2Vec/Doc2Vec models don't store the corpus data – they only examine it, in multiple passes, to train up the model. If you need to retrieve the original texts, you should use your own data structure.

Upvotes: 1

Related Questions