Reputation: 652
I am training on two identical sentences (documents) using from gensim.models.doc2vec import Doc2Vec
and when checking out the vectors for each sentence they are completely different. Does the Neural Network have a different random initialisation per sentence?
# imports
from gensim.models.doc2vec import LabeledSentence
from gensim.models.doc2vec import Doc2Vec
from gensim import utils
# Document iteration class (turns many documents in to sentences
# each document being once sentence)
class LabeledDocs(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
# print fin.read().strip(r"\n")
yield LabeledSentence(utils.to_unicode(fin.read()).split(),
[prefix])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
#print fin, fin.read()
self.sentences.append(
LabeledSentence(utils.to_unicode(fin.read()).split(),
[prefix]))
return self.sentences
# play and play3 are names of identical documents (diff gives nothing)
inp = LabeledDocs({"play":"play", "play3":"play3"})
model = Doc2Vec(size=20, window=8, min_count=2, workers=1, alpha=0.025,
min_alpha=0.025, batch_words=1)
model.build_vocab(inp.to_array())
for epoch in range(10):
model.train(inp)
# post to this model.docvecs["play"] is very different from
# model.docvecs["play3"]
Why is this ? Both play
and play3
contain :
foot ball is a sport
played with a ball where
teams of 11 each try to
score on different goals
and play with the ball
Upvotes: 2
Views: 1125
Reputation: 2069
Yes, each sentence vector is initialized differently.
In particular in the reset_weights
method. The code initializing the sentence vectors randomly is this:
for i in xrange(length):
# construct deterministic seed from index AND model seed
seed = "%d %s" % (model.seed, self.index_to_doctag(i))
self.doctag_syn0[i] = model.seeded_vector(seed)
Here you can see that each sentence vector is initialized using the random seed of the model and the tag of the sentence. Therefore it makes sense that in your example play
and play3
result in different vectors.
However if you train the model properly I would expect both vectors to end up very close to each other.
Upvotes: 3