Ben
Ben

Reputation: 109

Keras for find sentences similarities from pre-trained word2vec

I have pre-trained word2vec from gensim. And Using gensim for finding the similarities between words works as expected. But I am having problem in finding the similarities between two different sentences. Using of cosine similarities is not a good option for sentences and Its not giving good result. Soft Cosine similarities in gensim gives a little better results but still, it is also not looking good.

I found WMDsimilarities in gensim. This is a bit better than softcosine and cosine.

I am thinking if there is more option like using deep learning like keras and tensorflow to find the sentences similarities from pre-trained word2vec. I know the classification can be done using word embbeding and this seems somewhat better options but then I need to find a training data and labeled it from the scratch.

So, I am wondering if there is any other option which can be used pre-trained word2vec in keras and get the sentences similarities. Is there way. I am open to any suggestions and advice.

Upvotes: 0

Views: 811

Answers (1)

J. Ferrarons
J. Ferrarons

Reputation: 630

Before reimplementing the wheel I'd suggest to try doc2vec method from gensim, it works quite well and it's easy to use.

To implement it in Keras reusing the embeddings you have computed with gensim:

  1. Store the word embeddings in a file, one word per line with the corresponding embedding. Alternatively you can do as @Paul suggested and skip the step 2 and reuse the layer in step 3.
  2. Load word embeddings into a Keras Embedding layer. You can checkout this Keras tutorial for more details (check how embedding_layer variable is initialized).
  3. Then a sequence to sequence model can be used to compute the embedding of the text. In which you have an encoder that embeds the string and the decoder that converts the embedding back to a string. Here is a Keras tutorial that translates from English to French. You can use a similar process to transform your text into your text and pick the internal embedding for your similarity metric.

You can also have a look how the paragraph to vector model works, you can also implement it using Keras and loading the word embedding weights that you have computed.

Upvotes: 1

Related Questions