user12907213
user12907213

Reputation:

How to determine if two sentences talk about similar topics?

I would like to ask you a question. Is there any algorithm/tool which can allow me to do some association between words? For example: I have the following group of sentences:

(1)
    "My phone is on the table"
    "I cannot find the charger". # no reference on phone
(2) 
    "My phone is on the table"
    "I cannot find the phone's charger". 

What I would like to do is to find a connection, probably a semantic connection, which can allow me to say that the first two sentences are talking about a topic (phone) as two terms (phone and charger) are common within it (in general). Same for the second sentence. I should have something that can connect phone to charger, in the first sentence. I was thinking of using Word2vec, but I am not sure if this is something that I can do with it. Do you have any suggestions about algorithms that I can use to determine similarity of topics (i.e. sentence which are formulated in a different way, but having same topic)?

Upvotes: 7

Views: 3957

Answers (2)

Ural Bayhan
Ural Bayhan

Reputation: 63

This type of task is called sentence similarity, or more generally semantic textual simialirty. You might use a few different approaches for this type of task. On paperswithcode you can find benchmarks and the current state of the art.

First you can look at the ratio of shared words. Jaccard index is probably the simplest metric you can use for this. If you model both sentences as sets of words the jaccard index is the size of the intersection divided by the size of the union of these two sets.

Another way is to turn these sentences into vectors by counting the words and using cosine similarity to measure how closely they are related.

But not every word is equally important. To use this in your computations you can use a weighting scheme such as term frequency - inverse document frequency (TF-IDF) or BM25, which in their essence assign a greater weight to more important words. They measure the importance of the words by looking at how frequently they appear in all documents in your corpus.

You can improve these methods by only using entities mentioned in the text. In your example that would be I, phone, table and the charger. You can use spaCy or stanza for finding the entities if you are using python.

If you are using word embeddings such as word2vec, glove, or fasttext, you can take the average of the word vectors and use it as a vector for the whole sentence. Then you can again use cosine similarity.

Or on the more complicated side using word embeddings, you can use word mover's distance to measure the distance between two collections of word vectors.

There are also neural models for sentence similarity. Using transformer models is currently the state of the art for this kind of problem, as we can see on STSBenchmark a BERT-based transformer model is currently in the first place. This type of models usually need a lot of computation power to work, but you don't have to train each model from scratch, you can just download a model and use it right away.

There are many more methods for this probably. Here is a recent survey on semantic similarity methods.

Upvotes: 2

vukojevicf
vukojevicf

Reputation: 827

In Python I'm pretty sure you have a Sequence Matcher that you can usee

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

If you want your own algorithm I would suggest a Levenstains Distance (it calculates how many operations you need to turn one string(sentance) into another. Might be usefull.). I coded it myself in like this for two strings

    edits = [[x for x in range(len(str1) + 1)] for y in range(len(str2)+ 1)]
    for i in range(len(str2) + 1):
        edits[i][0] = i
    for i in range(1, len(str2) + 1):
        for j in range(1,  len(str1) + 1):
            if str2[i-1] == str1[j-1]:
                edits[i][j] = edits[i-1][j-1]
            else:
                edits[i][j] = 1 + min(edits[i-1][j-1], edits[i-1][j],
                                     edits[i][j-1])
    return edits[-1][-1]

[EDIT] For you, you want to compare if the sentances are about the similar topic. I would suggest any of the following algorithms (all are pretty easy)

  1. Jaccary Similarity
  2. K-means and Hierarchical Clustering Dendrogram
  3. Cosine Similarity

Upvotes: 4

Related Questions