yazan sayed
yazan sayed

Reputation: 1139

Algorithm for text similarity search

say I have String x= "algorithm" , and Stringy= "mgth"

String x contains all the letters in String y, I can split String y into a list of letters, and loop through this list to see if String x contains the letter y[index] ,, but I wonder If there's a more efficient way

Edit:

in kotlin there's a simple intersect function,, for example:

val x="algorithm".toList()
val y="mgth".toList()
val interesct=x.intersect(y) //returns a Set of matching chars

if (y.size == interesct.size){
    println("match")
}

Upvotes: 0

Views: 604

Answers (3)

Hasasn
Hasasn

Reputation: 908

Try this one

@Test
public void similarity() {
  String x = "algorithm";
  String y = "mgth";
  final boolean ohYes =
      y.chars().filter(yc -> x.chars().anyMatch(xc -> yc == xc)).count() == y.length();
  Assert.assertTrue(ohYes);
}

Upvotes: 1

Ishaan Javali
Ishaan Javali

Reputation: 1713

There is a more efficient way by using a Set.

String x = "algorithm";
String y = "mgth";
Set<Character> set = new HashSet<>();

for(char c: y.toCharArray())
   set.add(c);
for(char c: x.toCharArray())
   set.remove(c);

if(set.size() == 0) 
    System.out.println("X contains Y");
else 
    System.out.println("X does not contain Y");

What the above code does is add the characters in the smaller String to the set. Then, it removes each of the characters in the larger String.

If there are any leftover characters in the Set, then that means that the smaller String contained a letter that was not in the larger String.

Upvotes: 1

uptoyou
uptoyou

Reputation: 1797

Regexp for the rescue:

    String pattern = "mgth".chars()
            .mapToObj(ch -> "(?=.*" + (char) ch + ")")
            .collect(Collectors.joining());

    // ".*(?=.*m)(?=.*g)(?=.*t)(?=.*h).*"
    boolean matches = Pattern.compile(".*"+pattern+".*")
            .matcher("algorithm")
            .matches();

    System.out.println(matches);

This will match only if "algorithm" contains all characters in generated pattern from target string.

EDIT

Also, you can sort both strings, and perform comparison only in interval of [min("mgth"), max("mgth")] char values.

Upvotes: 1

Related Questions