LonsomeHell
LonsomeHell

Reputation: 593

String similarity TF-IDF Bag of words or Word2vec

I am trying to create an application that computes the similarity between 2 strings. The strings are not long. 3 Sentences long at maximum. I did some research and I came across some possible solution paths.

First one use bag of words: count words and compare the 2 produced vectors ( cosine similarity)

The second use TF-IDF and compare produced vectors.

The third is use word2vec and compare vectors.

Now for the questions.

Performance wise is word2vec performance better that TF-IDF for short sentences?

What is the best way to train word2vec model? Should I use a large amount of text ( wikipedia dump for example) or train it using just the sentences that are being compared.

How to get sentence similarity from word2vec. should I average the words in each sentence or is there a better solution?

Upvotes: 1

Views: 2351

Answers (1)

Mahmood Kohansal
Mahmood Kohansal

Reputation: 1041

  • With good train data, word2vec must have better performance. (I got good results from it)

  • You must have large amount of data for good model. The best way is using pre-trained data if you are working on English. There are good models in this link you can use. Google News pre-trained model is working perfect as I know.

  • It is common to use Average of words in part of text like sentence. The better way can be Weighted Average like tf-idf weighting average. Also there is a hot research on semantic textual similarity you can follow it from it's Wiki Page

Upvotes: 2

Related Questions