Reputation: 351
I have trained doc2vec model on 4 million records. I want to find most similar sentence to a new sentence i put in from my data but i am getting very bad results.
sample of data:
Xolo Era (Black, 8 GB)(1 GB RAM).
Sugar C6 (White, 16 GB)(2 GB RAM).
Celkon Star 4G+ (Black & Dark Blue, 4 GB)(512 MB RAM).
Panasonic Eluga I2 (Metallic Grey, 16 GB)(2 GB RAM).
Itel IT 5311(Champagne Gold).
Itel A44 Pro (Champagne, 16 GB)(2 GB RAM).
Nokia 2 (Pewter/ Black, 8 GB)(1 GB RAM).
InFocus Snap 4 (Midnight Black, 64 GB)(4 GB RAM).
Panasonic P91 (Black, 16 GB)(1 GB RAM).
Before passing this data i have done preprocessing which includes 1) Stop words removal. 2) special character and numeric value removal. 3) lowercase the data. I have also performed the same steps in testing process.
code which i used for training :
sentences=doc2vec.TaggedLineDocument('training_data.csv') # i have used TaggedLineDocument which can generate label or tags for my data
max_epochs = 100
vec_size = 100
alpha = 0.025
model = doc2vec.Doc2Vec(vector_size=vec_size,
alpha=alpha,
min_alpha=0.00025,
dm =1,
min_count=1)
model.build_vocab(sentences)
model.train(sentences, epochs=100, total_examples=model.corpus_count)
model.save('My_model.doc2vec')
well i am new to gensim and doc2vec so i have followed an example for training my model so please correct me if i have used wrong parameters.
on testing side
model = gensim.models.doc2vec.Doc2Vec.load('My_model.doc2vec')
test = 'nokia pewter black gb gb ram'.split()
new_vector = model.infer_vector(test)
similar = model.docvecs.most_similar([new_vector])
print(similar) # It returns index of sentence and similarity score
for testing i have passed same sentences which are present in training data but model does not give related documents as similar document,for example i got "lootmela tempered glass guard for micromax canvas juice" as a most similar sentence to "nokia pewter black gb gb ram" this sentence with 0.80 as a similarity score.
So my questions to you:
1) Do i need to reconsider parameters for model training?
2) Training process is correct?
3) How to build more accurate model for similarity?
4) Apart from doc2vec what will be your suggestion for similarity (keeping in mind i have very large data so training and testing time should not be much longer)
Please forgive if question formatting is not good.
Upvotes: 1
Views: 3035
Reputation: 54153
Doc2Vec
will have a harder time with shorter texts – and it appears your texts may only be 5-10 tokens.
Your texts also don't appear to be natural-language sentences - but rather product names. Whether Doc2Vec
/Word2Vec
-like analyses will do anything useful with such text fragments that don't have the same sort of co-occurrence diversity as natural spoken/written language will depend on the characteristics of the data. I'm not sure it would, but it ight – only trying/tweaking it will tell.
But, it's not clear what your desired results should be. What kinds of product-names should be returned as most-similar? Same brand? Same color? (If either of those, you could use a much simpler model than Doc2Vec
training.) Same specs, including memory? (If so, you wouldn't want to throw away numeric info – instead you might want to canonicalize it into single tokens that are meaningful at the word-by-word level that affects Doc2Vec
, such as turning "64 GB" into "64gb" or "2 GB RAM" into "2gbram".)
As this isn't regular natural language text, you likely have a very small constrained vocabulary - perhaps a few thousand tokens, rather than the tens-to-hundreds-of-thousands in normal language. And, each token may only appear in a small number of examples (a single producer's product line), and absolutely never appear with closely-related terms from similar competitive products (because product names don't mix proprietary names from competitors.) These factors will also present a challenge for this sort of algorithm – which needs many varied overlapping uses of words, and many words with fine-shades of meaning, to gradually nudge vectors into useful arrangements. A small vocabulary may require use of a much-smaller model (lower vector_size
) to avoid overfitting. If you had a dataset which hinted which products people consider comparable – either mentioned in same reviews, or searched-for by same people, or bought by same-people – you might want to create extra synthetic text examples which include those multiple products in the same text – so that the algorithm has a chance of learning such relationships.
Much Doc2Vec
/Word2Vec
work doesn't bother with removing stop-words, and may retain punctuation as standalone words.
You should show examples of what is actually in your "training_data.csv" file, to see what the algorithm is actually working with. Note that TaggedLineDocument
wouldn't handle a real comma-separate-values file correctly - it's expecting just one text per line, already whitespace-delimited. (Any commas would be left in-place, perhaps attached to field tokens.)
Lowering min_count
to 1 can often worsen results, because such rare tokens (with only 1 or a few occurrences) don't get good vectors, but if there are a lot of them in aggregate (which there are in normal texts but might not be here) can serve as training noise degrading other vectors.
You don't need to change min_alpha
, and in general should only tinker with defaults if you're sure what they mean and have a rigorous, repeatable scoring process for testing whether changes are improving results or not. (In the case of achieving a good similarity measure, such a score might be a set of pairs of items that should be more-similar than some third item. For each algorithm/parameters you try, how many such pairs get properly discovered as "more similar" than either item to the third?)
Inference, especially on short texts, may benefit from different parameters (such as more inference passes) – and the latest gensim release (3.5.0, July 2018) includes an important fix and adjustment of defaults for infer_vector()
. So be sure to use that version, and test the improvement of supplying it a larger epochs
value.
Overall, I'd suggest:
being clear about what a good similarity result should be: with examples of most- and least-similar items
use such examples to create a rigorous automated evaluation of model quality
preprocess in a domain-sensitive way that preserves meaningful distinctions; try to get/create texts that don't silo brand-words into tiny single-product examples that hide potential cross-brand relationships
don't change defaults unless you're sure it's helping
enable logging at INFO level so you can see progress of algorithm and reporting of things like effective vocabulary
You still might not get great results depending on what your real 'similarity' goal is – product names aren't the same sort of natural language Doc2Vec
works best on.
Another baseline to consider is just treating each product-name as a 'bag-of-words' which gives rise to a one-hot vector of which words (from the full vocabulary) are in it. The cosine-similarity of these one-hot vectors (perhaps with extra term weighting) would be a simple measure and at least capture things like putting all 'black' items somewhat nearer each other, or all 'nokia' items, etc.
Upvotes: 13