Mahmood Kohansal
Mahmood Kohansal

Reputation: 1041

Gensim Doc2Vec - Pass corpus sentences to Doc2Vec function

I used the MySentences class for extracting sentences from all files in a directory and use this sentences for train a word2vec model. My dataset is unlabeled.

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

sentences = MySentences('sentences')
model = gensim.models.Word2Vec(sentences)

Now I want to use that class to make a doc2vec model. I read Doc2Vec reference page. Doc2Vec() function gets sentences as parameter, but it doesn't accept above sentences variable and return error :

AttributeError: 'list' object has no attribute 'words'

What is the problem? What is the correct type of that parameter?

Update :

I think, unlabeled data is the problem. It seems doc2vec needs labeled data.

Upvotes: 1

Views: 1126

Answers (2)

Borislav Stoilov
Borislav Stoilov

Reputation: 3697

Unlike word2vec, doc2vec needs every train entry to be labeled with an unique id. This is needed because later when it predicts similarities its result will be doc ids (unique ids of the train entries), like words are the predictions for word2vec.

Here is a piece of my code that does the exact thing you want to achieve

 class DynamicCorpus(object):
 def __iter__(self):
     with open(csf_file) as fp:
         for line in fp:
             splt = line.split(':')
             text = splt[2].replace('\n', '')
             id = splt[0]
             yield TaggedDocument(text.split(), [id])

my csv file has format id:text

later you can just feed the corpus to the model

coprus = DynamicCorpus()

d2v = doc2vec.Doc2Vec(min_count=15,
                      window=10,
                      vector_size=300,
                      workers=15,
                      alpha=0.025,
                      min_alpha=0.00025,
                      dm=1)
d2v.build_vocab(corpus)

for epoch in range(training_iterations):
    d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter)
    d2v.alpha -= 0.0002
    d2v.min_alpha = d2v.alpha

Upvotes: 0

Mahmood Kohansal
Mahmood Kohansal

Reputation: 1041

There is no reason to use extra classes to solve the problem. In new updates of library, a new function TaggedLineDocument added to transform sentence to vector.

sentences = TaggedLineDocument(INPUT_FILE)

and then, train the model

model = Doc2Vec(alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)

for epoch in range(10):
    model.train(sentences)
    model.alpha -= 0.002
    model.min_alpha = model.alpha
    print epoch

Upvotes: 2

Related Questions