Lolly
Lolly

Reputation: 36442

Doc2Vec find the similar sentence

I am trying find similar sentence using doc2vec. What I am not able to find is actual sentence that is matching from the trained sentences.

Below is the code from this article:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)

# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)


# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])

But the above code only gives me vectors or numbers. But how can I get the actual sentence matched from training data. For Eg - In this case I am expecting the result as "I love building chatbots".

Upvotes: 1

Views: 6639

Answers (3)

Harshal Parekh
Harshal Parekh

Reputation: 6027

The output of similar_doc is: [('2', 0.991769552230835), ('0', 0.989276111125946), ('3', 0.9854298830032349)]

This shows the similarity score of each document in the data with the requested document and it is sorted in descending order.

Based in this, '2' index in the data is the closest to the requested data i.e. test_data.

print(data[int(similar_doc[0][0])])
// prints: I love building chatbots

Note: this code is giving different results every time, maybe you need a better model or more training data.

Upvotes: 3

Shankar Ganesh Jayaraman
Shankar Ganesh Jayaraman

Reputation: 1491

To get the actual result you have to pass the text as a vector to most_simlar method to get the actual result. Hard coding the most_similar(1) will always give static results.

similar_doc = model.docvecs.most_similar([v1])

Modified version of your code

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

def output_sentences(most_similar):
    for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(most_similar)//2), ('LEAST', len(most_similar) - 1)]:
      print(u'%s %s: %s\n' % (label, most_similar[index][1], data[int(most_similar[index][0])])))

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)

# to find most similar doc using tags
similar_doc = model.docvecs.most_similar([v1])
print(similar_doc)

# to print similar sentences
output_sentences(similar_doc) 


# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])

Semantic “Similar Sentences” with your dataset-NLP

If you are looking for accurate prediction with your dataset and which is less, you can go for,

pip install similar-sentences

Upvotes: 2

gojomo
gojomo

Reputation: 54243

Doc2Vec isn't going to give good results on toy-sized datasets, so you shouldn't expect anything meaningful until using much more data.

But also, a Doc2Vec model doesn't retain within itself the full texts you supply during training. It just remembers the learned vectors for each text's tag – which is usually a unique identifier. So when you get back results from most_similar(), you'll be getting back tag values, which you then need to look-up yourself, in your own code/data, to retrieve full documents.

Separately:

Calling train() multiple times in a loop like you're doing is a bad and error-prone idea, as is managing alpha/min_alpha explicitly. You should not follow any tutorial/guide which recommends that approach.

Don't change the defaults for the alpha parameters, and call train() once, with your desired epochs count – and it will do the right number of passes, and right learning-rate management.

Upvotes: 2

Related Questions