Reputation: 21
Suppose I have two documents:
document 1 : Where can I buy this product1 in paris. document 2 : Where can I buy this product2 in paris.
Assume product1 and product2 are not in word2vec and I need to train my own word2vec model. Since the context is same, will word2vec consider product1 and product2 as synonyms?
Will they have similar word embeddings? If yes, how to make them non related to each other? Should I go for doc2vec model in this case?
Upvotes: 0
Views: 881
Reputation: 2270
The concept behind word embeddings is that the context of a word determines its meaning. If two words were always to occur in exactly the same context, they would be identical (this never happens). This works well for pretty much any word, except for names.
Names don't have a 'linguistic' meaning; their meaning is a pointer to something in the real world outside of language. Their context then depends on the use of that something in language: the name of a car brand is usually used in different contexts from coffee brands. "I'll drive my new X" works well with VW, but not so well with Lavazza. Hence they occur in different contexts and thus have a different meaning.
If the products are the same kind (eg VW vs Mercedes), then their contexts will be the same. But they might also be subtly different: you wouldn't use language to boast about your new Skoda in the same way you would about your new Bentley. So the embeddings for "Skoda" and "Bentley" will be similar, but not identical. But if there are essentially no differences, the context, and thus the embedding, will be the same. Incidentally that is why people often confuse the names of their kids when they are young -- you are using the names in pretty much exactly the same contexts, so they're sometimes tricky to keep apart.
The solution to this dilemma is to find more data where product1 and product2 are used in different contexts. In your examples they are simply presented as something you want to buy in Paris. You need to find examples where they are used, repaired, or break; anything that differentiates them. And no other context-based representation will be able to solve this for you without such data.
Upvotes: 1