Reputation: 993
I have a value such as Color, and a list of String : {Colour,Color,Main Color, Main Colour, Theme, Brand, Subject ..... etc}
I would like to get the most similar string , except the searched string itself. In this example would expect to get Colour. (NOT Color)
I am sorting the list I am using the following rules and ranked the rules :
It takes a lot of time to go over a list of 1000 relevant candidates. Moreover I have lots of candidates to check.
Any other efficient way?
Original Code:
public static List findSimilarity(String word, List candidates) {
List recommendations = new ArrayList();
if (!word.equals("")) {
for (String candidate : candidates) {
if (!word.equals(candidate)) { //1. same token , lower/upper cases , ignore white spaces
if (StringUtils.deleteWhitespace(word).equalsIgnoreCase(StringUtils.deleteWhitespace(candidate))) {
recommendations.add(candidate);
}
//2. same tokens diff order
else if (candidate.split(" ").length == word.split(" ").length) {
String[] candidatearr = candidate.split(" ");
String[] wordarr = word.split(" ");
boolean status = true;
SortIgnoreCase icc = new SortIgnoreCase();
Arrays.sort(candidatearr, icc);
Arrays.sort(wordarr, icc);
for (int i = 0; i < candidatearr.length; i++) {
if (!(candidatearr[i] == null ? wordarr[i] == null : wordarr[i].equalsIgnoreCase(candidatearr[i])))
status = false;
}
if (status) {
recommendations.add(candidate);
}
}
}
}
//3. distance between words
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
String[] candidatearr = candidate.split(" ");
String[] wordarr = word.split(" ");
//check for acronym
if ((wordarr.length == 1 && candidatearr.length > 1) || (wordarr.length > 1 && candidatearr.length == 1)) {
String acronym = "";
if (wordarr.length > candidatearr.length) {
for (String tmp : wordarr) {
if (!tmp.equals("")) {
acronym = acronym + tmp.substring(0, 1);
}
}
if (acronym.equalsIgnoreCase(candidatearr[0])) {
recommendations.add(candidate);
}
} else {
for (String tmp : candidatearr) {
if (!tmp.equals("")) {
acronym = acronym + tmp.substring(0, 1);
}
}
if (acronym.equalsIgnoreCase(wordarr[0])) {
recommendations.add(candidate);
}
}
}
}
}
}
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
int dist = 0;
String check = "";
if (word.length() > candidate.length()) {
check = candidate;
} else {
check = word;
}
if (check.length() <= 3) {
dist = 0;
} else if (check.length() > 3 && check.length() <= 5) {
dist = 1;
} else if (check.length() > 5) {
dist = 2;
}
if (StringUtils.getLevenshteinDistance(word, candidate) <= dist) {
//if(Levenshtein.distance(word,candidate) <= dist){
recommendations.add(candidate);
}
}
}
}
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
String[] candidatearr = candidate.split(" ");
String[] wordarr = word.split(" ");
for (String cand : candidatearr) {
for (String wor : wordarr) {
if (cand.equals(wor) && cand.length() > 4) {
recommendations.add(candidate);
}
}
}
}
}//for
if (recommendations.size() > 4) {
recommendations.clear();
}
}
//4. low priority - starts with
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
if (candidate.startsWith(word) || word.startsWith(candidate)) {
recommendations.add(candidate);
}
}
}
if (recommendations.size() > 4) {
recommendations.clear();
}
}
//5. low priority - contain word
if (recommendations.size() == 0) {
for (String candidate : candidates) {
if (!word.equals(candidate)) {
if (candidate.contains(word) || word.contains(candidate)) {
recommendations.add(candidate);
}
}
}
if (recommendations.size() > 4) {
recommendations.clear();
}
}
}
return recommendations;
}
Thanks, M.
Upvotes: 0
Views: 808
Reputation: 2492
Edited
I wrapped the answer given by Bohemian into the context of your original code for your better understanding.
The line .map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" ")))
splits multi-word terms, sorts, and joins again to eliminate permutations of same words. This is an answer to the permutational equality challenge of terms like "main color" and "color main".
However, it does not make sense to catch all the business requirements of your task in the context of this question. By this answer you have got an outline of the solution. The problem of efficiency is addressed. You might need more stages in your pipeline, but that's a different story. The strength of the approach is that all stages are detached, so you can ask questions and seek help for each stage independently.
public static String findSimilarity(String word, List<String> candidatesList) {
// Populating the set with distinct values of the input terms
Set<String> candidates = candidatesList.stream()
.map(String::toLowerCase)
.map(term -> Arrays.stream(term.split(" ")).sorted().collect(Collectors.joining(" "))) // eliminates permutations
.collect(Collectors.toSet());
Map<String, Integer> cache = new ConcurrentHashMap<>();
return candidates.parallelStream()
.map(String::trim)
// add more mappers if needed
.filter(s -> !s.equalsIgnoreCase(word))
// add more filters if needed
.min((a, b) -> Integer.compare(
cache.computeIfAbsent(a, k -> getLevenshteinDistance(word, k)),
cache.computeIfAbsent(b, k -> getLevenshteinDistance(word, k))))
.get(); // get the closest match
}
Upvotes: 0
Reputation: 424973
Your problem is one of time complexity. Collections.sort() is an O(n log n) operation, and this is how many times the compare method is called. The problem is that Levenshtein is an "expensive" calculation.
You can improve sorting performance by finding a way to calculate it exactly once for each item making the Levenshtein calculation an O(n), operation, then sorting on the stored calculated distance.
I did a test using a variety of list sizes sorting Lists of random integers, and the actual number of times compare()
was called was very close to n log2 n, so for a list of about 1000 Strings, it will be about 10 times faster, because log2(1000) is about 10.
You can further improve performance by not sorting, but by just getting the minimum item specifying the same comparator.
Another improvement is to avoid the distinct()
call (which is relatively expensive), by using Set (which enforces uniqueness) to hold the candidates.
If you can, populate the candidates with values already trained and lowercased, so you avoid trimming and lowercasing and lowercase every run. Do the same your input, so you can use equals()
instead of the slower equalsIgnoreCase()
.
Here's one way:
import static org.apache.commons.lang.StringUtils.getLevenshteinDistance;
String search; // your input
Set<String> candidates = new HashSet<>(); // populate this with lots of values
Map<String, Integer> cache = new ConcurrentHashMap<>();
String closest = candidates.parallelStream()
.map(String::trim)
.filter(s -> !s.equalsIgnoreCase(search))
.min((a, b) -> Integer.compare(
cache.computeIfAbsent(a, k -> getLevenshteinDistance(search, k)),
cache.computeIfAbsent(b, k -> getLevenshteinDistance(search, k))))
.get();
This code executes in about 50ms for 1000 random candidates, and in about 1 second for 1 million candidates.
Upvotes: 1