doc2vec inaccurate cosine similarity

Question

I have trained doc2vec model on 4 million records. I want to find most similar sentence to a new sentence i put in from my data but i am getting very bad results.

sample of data:

Xolo Era (Black, 8 GB)(1 GB RAM).
Sugar C6 (White, 16 GB)(2 GB RAM).
Celkon Star 4G+ (Black & Dark Blue, 4 GB)(512 MB RAM).
Panasonic Eluga I2 (Metallic Grey, 16 GB)(2 GB RAM).
Itel IT 5311(Champagne Gold).
Itel A44 Pro (Champagne, 16 GB)(2 GB RAM).
Nokia 2 (Pewter/ Black, 8 GB)(1 GB RAM).
InFocus Snap 4 (Midnight Black, 64 GB)(4 GB RAM).
Panasonic P91 (Black, 16 GB)(1 GB RAM).

Before passing this data i have done preprocessing which includes 1) Stop words removal. 2) special character and numeric value removal. 3) lowercase the data. I have also performed the same steps in testing process.

code which i used for training :

sentences=doc2vec.TaggedLineDocument('training_data.csv') # i have used TaggedLineDocument which can generate label or tags for my data

max_epochs = 100
vec_size = 100
alpha = 0.025

model = doc2vec.Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                dm =1,
                min_count=1)
model.build_vocab(sentences)
model.train(sentences, epochs=100, total_examples=model.corpus_count)
model.save('My_model.doc2vec')

well i am new to gensim and doc2vec so i have followed an example for training my model so please correct me if i have used wrong parameters.

on testing side

model = gensim.models.doc2vec.Doc2Vec.load('My_model.doc2vec')
test = 'nokia pewter black gb gb ram'.split()
new_vector = model.infer_vector(test)
similar = model.docvecs.most_similar([new_vector]) 
print(similar) # It returns index of sentence and similarity score

for testing i have passed same sentences which are present in training data but model does not give related documents as similar document,for example i got "lootmela tempered glass guard for micromax canvas juice" as a most similar sentence to "nokia pewter black gb gb ram" this sentence with 0.80 as a similarity score.

So my questions to you: 
1) Do i need to reconsider parameters for model training?
2) Training process is correct?
3) How to build more accurate model for similarity?
4) Apart from doc2vec what will be your suggestion for similarity (keeping in mind i have very large data so training and testing time should not be much longer)

Please forgive if question formatting is not good.

doc2vec inaccurate cosine similarity

Answers (1)

Related Questions