Reputation: 593
I am trying to create an application that computes the similarity between 2 strings. The strings are not long. 3 Sentences long at maximum. I did some research and I came across some possible solution paths.
First one use bag of words: count words and compare the 2 produced vectors ( cosine similarity)
The second use TF-IDF and compare produced vectors.
The third is use word2vec and compare vectors.
Now for the questions.
Performance wise is word2vec performance better that TF-IDF for short sentences?
What is the best way to train word2vec model? Should I use a large amount of text ( wikipedia dump for example) or train it using just the sentences that are being compared.
How to get sentence similarity from word2vec. should I average the words in each sentence or is there a better solution?
Upvotes: 1
Views: 2351
Reputation: 1041
With good train data, word2vec must have better performance. (I got good results from it)
You must have large amount of data for good model. The best way is using pre-trained data if you are working on English. There are good models in this link you can use. Google News pre-trained model is working perfect as I know.
It is common to use Average of words in part of text like sentence. The better way can be Weighted Average like tf-idf weighting average. Also there is a hot research on semantic textual similarity you can follow it from it's Wiki Page
Upvotes: 2