Jonathan Grey
Jonathan Grey

Reputation: 51

How to go to from word similarity to overall sentence similarity

I have implemented a sentence similarity method using WS4J.

I have read about sentence similarity in articles which is based on word similarity in two sentences. But I couldn't find a method which computes and returns a single value for the overall sentence similarity based o the word similarities.

A similar question was asked at in this website at sentence-similarity-using-ws4j

As you can see I have managed to code with WS4J up to the extent where any word in sentence a finds a synset match in the other sentence (and the matching value is above 0.9) returns a match message. But this is not a good approach I guess.

I have found the article by Yuhua et [2]. all very useful but cannot figure out the method they used for overall sentence similarity.

public static String sentenceSim(String se1, String se2, RelatednessCalculator rc) {
        String similarityMessage = "";
        String similarityMessage2 = "";

        if (se1 == null || se2 == null) {
            return "null";
        }

        if (nlp == null) {
            nlp = OpenNLPSingleton.INSTANCE;
        }
        // long t00 = System.currentTimeMillis();
        String[] words1 = nlp.tokenize(se1); // base
        String[] words2 = nlp.tokenize(se2); // sentence
        String[] postag1 = nlp.postag(words1);
        String[] postag2 = nlp.postag(words2);


        String u = "";
        int matchCount = 0;     

        int counter = 0;
        String mLC = rc.toString().toLowerCase();
        for (int j = 0; j < words2.length; j++) { // sentence
            String pt2 = postag2[j];
            String w2 = MorphaStemmer.stemToken(words2[j].toLowerCase(), pt2);
            POS p2 = mapPOS(pt2);
            // System.out.print(words2[j]+"(POS "+pt2+")");
            for (int i = 0; i < words1.length; i++) { // base
                String pt1 = postag1[i];
                String origWord1 = words1[i];
                String origWord2 = words2[j];
                String w1 = MorphaStemmer.stemToken(words1[i].toLowerCase(), pt1);
                POS p1 = mapPOS(pt1);
                String popup = mLC + "( " + w1 + "#" + (p1 != null ? p1 : "INVALID_POS") + " , " + w2 + "#"
                        + (p2 != null ? p2 : "INVALID_POS") + ")";
                String dText;
                // boolean acceptable = rc.getPOSPairs().isAcceptable(p1, p2);

                // ALL WORDS FROM BASE HAS TO MATCH - IF ONE DOESNT,
                // THEN ITS NOT MATCH
                double d = -1;
                if (p1 != null && p2 != null) {//
                    double r = wordSim(w1, w2, rc);
                    if (r > 0.9) {
                        matchCount++;
                        similarityMessage += "\t\t Similarity Found (Base : sentence) ('Base Word: " + origWord1 + "=" + w1 + " "
                                + p1 + "', Sentence Word: '" + origWord2 + "=" + w2 + " " + p2 + "') =  " + r + "\n";
                        System.out.println(similarityMessage);
                    }
                }
            }
            // System.out.println();
        }

        // output if all words in sentence 1 have found matches in sentences 2
        if (matchCount == words1.length) {          
            similarityMessage2 = "\t\tFound all matches for base  in sentence: ";
            System.out.println("\t\tBase " + se1);
            System.out.println("\t\tFound all matches for base  in sentence: ");
            System.out.println(similarityMessage);
        }
        similarityMessage = "";
        return similarityMessage;
    } 

I have done my codes in Java, so I was looking for some java implemetations.

[2]: Li, Y., McLean, D., Bandar, Z. A., O'shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. Knowledge and Data Engineering, IEEE Transactions on, 18(8), 1138-1150.

Upvotes: 0

Views: 1063

Answers (1)

Nadeeshaan
Nadeeshaan

Reputation: 376

There are different approaches to calculate the sentence similarity and the approach can depend on your use case or the requirement. One of the famous methods of doing so is to consider the most essential syntactic units in a sentence which has a major impact to the meaning of the sentence. (Ex: Verbs, nouns, adverbs, adjectives, etc.). Also use of the vector space model to calculate the similarity between two sentences is a significantly accurate method and there are so many resources regarding this area.

Upvotes: 1

Related Questions