Francisco Vargas
Francisco Vargas

Reputation: 652

Why does gensim Doc2Vec give me different vectors for the same sentence?

I am training on two identical sentences (documents) using from gensim.models.doc2vec import Doc2Vec and when checking out the vectors for each sentence they are completely different. Does the Neural Network have a different random initialisation per sentence?

# imports
from gensim.models.doc2vec import LabeledSentence
from gensim.models.doc2vec import Doc2Vec
from gensim import utils

# Document iteration class (turns many documents in to sentences
# each document being once sentence)
class LabeledDocs(object):
    def __init__(self, sources):
        self.sources = sources
        flipped = {}
        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')

    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                # print fin.read().strip(r"\n")
                yield LabeledSentence(utils.to_unicode(fin.read()).split(),
                                      [prefix])

    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                #print fin, fin.read()
                self.sentences.append(
                    LabeledSentence(utils.to_unicode(fin.read()).split(),
                                    [prefix]))
        return self.sentences

# play and play3 are names of identical documents (diff gives nothing)
inp = LabeledDocs({"play":"play", "play3":"play3"})
model = Doc2Vec(size=20, window=8, min_count=2, workers=1, alpha=0.025,
                min_alpha=0.025, batch_words=1)
model.build_vocab(inp.to_array())
for epoch in range(10):
    model.train(inp)

# post to this model.docvecs["play"] is very different from
# model.docvecs["play3"]

Why is this ? Both play and play3 contain :

foot ball is a sport
played with a ball where
teams of 11 each try to
score on different goals
and play with the ball

Upvotes: 2

Views: 1125

Answers (1)

Álvaro
Álvaro

Reputation: 2069

Yes, each sentence vector is initialized differently.

In particular in the reset_weights method. The code initializing the sentence vectors randomly is this:

for i in xrange(length):
    # construct deterministic seed from index AND model seed
    seed = "%d %s" % (model.seed, self.index_to_doctag(i))
    self.doctag_syn0[i] = model.seeded_vector(seed)

Here you can see that each sentence vector is initialized using the random seed of the model and the tag of the sentence. Therefore it makes sense that in your example play and play3 result in different vectors.

However if you train the model properly I would expect both vectors to end up very close to each other.

Upvotes: 3

Related Questions