We are Borg
We are Borg

Reputation: 5313

Java: Probabilistic text matching, detect how much percentage of the text matches.

I am working on a Java application where we have search functionality. Now, for the search, I am searching using wild-cards. So if someone searches "Hello Kitty", They will also get results for kit, hell, hello, etc. After the search, I am assigning scores to the results based upon their clicks, but how can I compare the results to conclude that the results is a 100% match or 80% match, for example "Hello Kit", is almost a match to "hello kitty". Is there any way to do this?

Search code :

Directory directory = FSDirectory.open(path);
 IndexReader indexReader = DirectoryReader.open(directory);
 IndexSearcher indexSearcher = new IndexSearcher(indexReader);
 Query query = new WildcardQuery(new Term("contents", "*" + str + "*"));
  TopDocs topDocs = indexSearcher.search(query, 1000);
 for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
        Document document = indexSearcher.doc(scoreDoc.doc);
        IndexableField value = document.getField("score");
        if (value != null) {
               sortedMap.put(Integer.valueOf(document.get("id")), (Integer) value.numericValue());
          } else {
               sortedMap.put(Integer.valueOf(document.get("id")), 0);
            }
  }
  indexSearcher.getIndexReader().close();
  directory.close();

Thank you.

Upvotes: 1

Views: 1280

Answers (1)

Mark Kvetny
Mark Kvetny

Reputation: 674

Sounds like you're looking for Dice's Coefficient. Here's a java implementation:

public static double diceCoefficient(String s1, String s2)
{
    Set<String> nx = new HashSet<String>();
    Set<String> ny = new HashSet<String>();

    for (int i=0; i < s1.length()-1; i++) {
        char x1 = s1.charAt(i);
        char x2 = s1.charAt(i+1);
        String tmp = "" + x1 + x2;
        nx.add(tmp);
    }
    for (int j=0; j < s2.length()-1; j++) {
        char y1 = s2.charAt(j);
        char y2 = s2.charAt(j+1);
        String tmp = "" + y1 + y2;
        ny.add(tmp);
    }

    Set<String> intersection = new HashSet<String>(nx);
    intersection.retainAll(ny);
    double totcombigrams = intersection.size();

    return (2*totcombigrams) / (nx.size()+ny.size());
}

https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient#Java

The algorithm assigns a number from 0 to 1 to a pair of strings, the higher the number the more similar they are. So basically just what you're asking for.

Upvotes: 3

Related Questions