Saif ali Karedia
Saif ali Karedia

Reputation: 912

How to keep the data consistent?

Suppose, I have an ArrayList aList1 with the following entries: Mohandas Gandhi, M Gandhi, Martin Luther King, Kim Jong, Abrahm Lincln, Kim Jng

Suppose, I have another ArrayList aList2 of all correct names. How can I match each item in aList1 with aList2?

I want the final output to be Mohandas Gandhi, Mohandas Gandhi, Martin Luther King, Kim Jong, Abraham Lincoln, Kim Jong.

The output should be without Spelling mistakes. How do I match the two words? If I can match two words, then I can use Edit distance to convert one word to another word.

I need to code this in Java.

Upvotes: 0

Views: 107

Answers (2)

Suparna
Suparna

Reputation: 1172

How to approach such problems?

While I think there may be a number of ways to do it, I think the simplest one is to determine the similarity or Fuzzy test. And using Levenshtein distance you can determine how much % you will need to change one string to convert to the other. You can then apply filters to determine which one is close. In my case, I have used apache commons StringUtils to give me the Edit Distance value.

EDIT: What kind of filtering criteria can be setup?

You have to setup your own rules for filtering.

  1. Ignore results where the percentage match is less than 50%.
  2. Treat the percentages as ordinary numbers, and sum them to create a "total match" between the search terms and document.

In the example code below, I have only used #1. Although, I did not ignore anything as the data points matched.

But imagine, "Money Laundering" and "Money Facts", and the user searches for "Mony Lawndaring".

Word 1 Word 2 Match

Mony Money 80%

Mony Laundering 10%

Lawndaring Money 10%

Lawndaring Laundering 80%

Total match for "Mony Lawndaring" against "Money Laundering" is 160.

Mony Lawndaring vs Money Facts

Word 1 Word 2 Match

Mony Money 80%

Mony Facts 0%

Lawndaring Money 10%

Lawndaring Facts 10%

Total match for "Mony Lawndaring" against "Money Facts" is 80.

Result

Then by simply ordering the search results by the highest total match, we can see the "Money Laundering" document appear before "Money Facts" in the search results list!

package algorithms;

import org.apache.commons.lang.*;
import java.util.ArrayList;
import java.util.Arrays;

public class LevenshteinDistance {

    public static double similarity(String s1, String s2) {
          String longer = s1, shorter = s2;
          if (s1.length() < s2.length()) { // longer should always have greater length
            longer = s2; shorter = s1;
          }
          int longerLength = longer.length();
          if (longerLength == 0) { return 1.0; }
          int distance = StringUtils.getLevenshteinDistance(longer, shorter);
          return (longerLength - distance) / (double) longerLength;
    }

    public static void main(String[] args) {
        ArrayList<String> aList1 = new ArrayList<String>(Arrays.asList("Mohandas Gandhi", "M Gandhi", "Martin Luther King", "Kim Jong", "Abrahm Lincln", "Kim Jng"));
        ArrayList<String> aList2 = new ArrayList<String>(Arrays.asList("Mohandas Gandhi", "Martin Luther King", "Kim Jong", "Abrahm Lincln"));
        ArrayList<String> output = new ArrayList<String>();

        for (String incorrect : aList1){
            for (String correct : aList2){
                if (similarity(incorrect, correct) > 0.5){
                    System.out.println("Match found with similarilty more than 50% between :" + incorrect + " from first list and " + correct + " from second list" );
                    output.add(correct);
                    continue;
                }
            }
        }

        System.out.println("Output: ");

        for (String out : output){
            System.out.println(out);
        }   
    }
}

Result:

Match found with similarilty more than 50% between :Mohandas Gandhi from first list and Mohandas Gandhi from second list

Match found with similarilty more than 50% between :M Gandhi from first list and Mohandas Gandhi from second list

Match found with similarilty more than 50% between :Martin Luther King from first list and Martin Luther King from second list

Match found with similarilty more than 50% between :Kim Jong from first list and Kim Jong from second list

Match found with similarilty more than 50% between :Abrahm Lincln from first list and Abrahm Lincln from second list

Match found with similarilty more than 50% between :Kim Jng from first list and Kim Jong from second list`

Final Output Array:

Mohandas Gandhi
Mohandas Gandhi
Martin Luther King
Kim Jong
Abrahm Lincln
Kim Jong

Upvotes: 1

Alex Hall
Alex Hall

Reputation: 36023

Something like this should get you started:

String[] incorrectNames = "Mohandas Gandhi, M Gandhi, Martin Luther King, Kim Jong, Abrahm Lincln, Kim Jng".split(", ");
String[] dictionary = "Mohandas Gandhi, Martin Luther King, Kim Jong, Abraham Lincoln".split(", ");

List<String> correctedNames = new ArrayList<>();
for (String incorrectName : incorrectNames) {
    int distance = Integer.MAX_VALUE;
    String closestMatch = null;
    for (String correctName : dictionary) {
        int currentDistance = levenshteinDistance(incorrectName, correctName);
        if (distance > currentDistance) {
            distance = currentDistance;
            closestMatch = correctName;
        }
    }
    correctedNames.add(closestMatch);
}

return correctedNames;

You will of course need an implementation of levenshteinDistance. Other caveats: the algorithm is O(m*n) where m is the size of the dictionary and n is the number of names to be corrected, and the Levenshtein distance might be too simple to do this well.

Upvotes: 2

Related Questions