userit1985
userit1985

Reputation: 993

Efficient Way to find most similar value

I have a value such as Color, and a list of String : {Colour,Color,Main Color, Main Colour, Theme, Brand, Subject ..... etc}

I would like to get the most similar string , except the searched string itself. In this example would expect to get Colour. (NOT Color)

I am sorting the list I am using the following rules and ranked the rules :

  1. Filter the same value
  2. check upper lower cases
  3. delete whitespaces. trim
  4. using Levenshtein distance
  5. String order : Main color = Color Main
  6. check for acronym : HP - Hewlett Packard

It takes a lot of time to go over a list of 1000 relevant candidates. Moreover I have lots of candidates to check.

Any other efficient way?

Original Code:

public static List findSimilarity(String word, List candidates) {
    List recommendations = new ArrayList();
    if (!word.equals("")) {
        for (String candidate : candidates) {
            if (!word.equals(candidate)) { //1. same token , lower/upper cases , ignore white spaces
                if (StringUtils.deleteWhitespace(word).equalsIgnoreCase(StringUtils.deleteWhitespace(candidate))) {
                    recommendations.add(candidate);
                }
                //2. same tokens diff order
                else if (candidate.split(" ").length == word.split("     ").length) {
                    String[] candidatearr = candidate.split(" ");
                    String[] wordarr = word.split(" ");
                    boolean status = true;
                    SortIgnoreCase icc = new SortIgnoreCase();
                    Arrays.sort(candidatearr, icc);
                    Arrays.sort(wordarr, icc);
                    for (int i = 0; i < candidatearr.length; i++) {
                        if (!(candidatearr[i] == null ? wordarr[i] == null : wordarr[i].equalsIgnoreCase(candidatearr[i])))
                            status = false;
                    }

                    if (status) {
                        recommendations.add(candidate);
                    }
                }
            }
        }
        //3. distance between words
        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    String[] candidatearr = candidate.split(" ");
                    String[] wordarr = word.split(" ");
                    //check for acronym
                    if ((wordarr.length == 1 && candidatearr.length > 1) || (wordarr.length > 1 && candidatearr.length == 1)) {
                        String acronym = "";
                        if (wordarr.length > candidatearr.length) {
                            for (String tmp : wordarr) {
                                if (!tmp.equals("")) {
                                    acronym = acronym + tmp.substring(0, 1);
                                }
                            }

                            if (acronym.equalsIgnoreCase(candidatearr[0])) {
                                recommendations.add(candidate);
                            }
                        } else {
                            for (String tmp : candidatearr) {
                                if (!tmp.equals("")) {
                                    acronym = acronym + tmp.substring(0, 1);
                                }
                            }

                            if (acronym.equalsIgnoreCase(wordarr[0])) {
                                recommendations.add(candidate);
                            }
                        }
                    }
                }
            }
        }

        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    int dist = 0;
                    String check = "";
                    if (word.length() > candidate.length()) {
                        check = candidate;
                    } else {
                        check = word;
                    }
                    if (check.length() <= 3) {
                        dist = 0;
                    } else if (check.length() > 3 && check.length() <= 5) {
                        dist = 1;
                    } else if (check.length() > 5) {
                        dist = 2;
                    }

                    if (StringUtils.getLevenshteinDistance(word, candidate) <= dist) {
                        //if(Levenshtein.distance(word,candidate) <= dist){
                        recommendations.add(candidate);
                    }
                }
            }
        }

        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    String[] candidatearr = candidate.split(" ");
                    String[] wordarr = word.split(" ");

                    for (String cand : candidatearr) {
                        for (String wor : wordarr) {
                            if (cand.equals(wor) && cand.length() > 4) {
                                recommendations.add(candidate);

                            }
                        }
                    }
                }
            }//for
            if (recommendations.size() > 4) {
                recommendations.clear();
            }
        }

        //4. low priority - starts with
        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    if (candidate.startsWith(word) || word.startsWith(candidate)) {
                        recommendations.add(candidate);
                    }
                }
            }
            if (recommendations.size() > 4) {
                recommendations.clear();
            }
        }

        //5. low priority - contain word
        if (recommendations.size() == 0) {
            for (String candidate : candidates) {
                if (!word.equals(candidate)) {
                    if (candidate.contains(word) || word.contains(candidate)) {
                        recommendations.add(candidate);
                    }
                }
            }
            if (recommendations.size() > 4) {
                recommendations.clear();
            }
        }
    }
    return recommendations;
}

Thanks, M.

Upvotes: 0

Views: 808

Answers (2)

nolexa
nolexa

Reputation: 2492

Edited

I wrapped the answer given by Bohemian into the context of your original code for your better understanding.

The line .map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" "))) splits multi-word terms, sorts, and joins again to eliminate permutations of same words. This is an answer to the permutational equality challenge of terms like "main color" and "color main".

However, it does not make sense to catch all the business requirements of your task in the context of this question. By this answer you have got an outline of the solution. The problem of efficiency is addressed. You might need more stages in your pipeline, but that's a different story. The strength of the approach is that all stages are detached, so you can ask questions and seek help for each stage independently.

public static String findSimilarity(String word, List<String> candidatesList) {

    // Populating the set with distinct values of the input terms
    Set<String> candidates = candidatesList.stream()
            .map(String::toLowerCase)
            .map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" "))) // eliminates permutations
            .collect(Collectors.toSet());

    Map<String, Integer> cache = new ConcurrentHashMap<>();

    return candidates.parallelStream()
            .map(String::trim)
                    // add more mappers if needed
            .filter(s -> !s.equalsIgnoreCase(word))
                    // add more filters if needed
            .min((a, b) -> Integer.compare(
                    cache.computeIfAbsent(a, k -> getLevenshteinDistance(word, k)),
                    cache.computeIfAbsent(b, k -> getLevenshteinDistance(word, k))))
            .get(); // get the closest match
}

Upvotes: 0

Bohemian
Bohemian

Reputation: 424973

Your problem is one of time complexity. Collections.sort() is an O(n log n) operation, and this is how many times the compare method is called. The problem is that Levenshtein is an "expensive" calculation.

You can improve sorting performance by finding a way to calculate it exactly once for each item making the Levenshtein calculation an O(n), operation, then sorting on the stored calculated distance.

I did a test using a variety of list sizes sorting Lists of random integers, and the actual number of times compare() was called was very close to n log2 n, so for a list of about 1000 Strings, it will be about 10 times faster, because log2(1000) is about 10.

You can further improve performance by not sorting, but by just getting the minimum item specifying the same comparator.

Another improvement is to avoid the distinct() call (which is relatively expensive), by using Set (which enforces uniqueness) to hold the candidates.

If you can, populate the candidates with values already trained and lowercased, so you avoid trimming and lowercasing and lowercase every run. Do the same your input, so you can use equals() instead of the slower equalsIgnoreCase().

Here's one way:

import static org.apache.commons.lang.StringUtils.getLevenshteinDistance;

String search; // your input
Set<String> candidates = new HashSet<>(); // populate this with lots of values
Map<String, Integer> cache = new ConcurrentHashMap<>();
String closest = candidates.parallelStream()
    .map(String::trim)
    .filter(s -> !s.equalsIgnoreCase(search))
    .min((a, b) -> Integer.compare(
      cache.computeIfAbsent(a, k -> getLevenshteinDistance(search, k)),
      cache.computeIfAbsent(b, k -> getLevenshteinDistance(search, k))))
    .get();

This code executes in about 50ms for 1000 random candidates, and in about 1 second for 1 million candidates.

Upvotes: 1

Related Questions