Ariel Netz
Ariel Netz

Reputation: 111

For input string with multiple words - what is the most efficient way to check if any of them start with some other string?

I need to implement a java method that gets set of strings and input string, and returns a subset of the strings, containing all strings from the original set that has any word starts with the input string. For example, if a string is "Stack Overflow", and the input is "Over", it should be in the subset. But if a string is "Stack Overflow", and the input is "flow, it should not be in the subset.

public Set<String> findMatches (Set<String> names, String input);

Since the set size is huge (100 milions) I need to do this in the most efficient way. Three ways I tried so far came with confusing results:

  1. Split each string by a blank space and get array of strings, and then, on each of the items in the array - invoke String's startsWith method.
  2. For each string, check if it starts with the input, of contains " " + input (blank space followed by the input).
  3. Regex.

I tested these methods and measured times, but surprisingly - for different input values (the set of strings and the input string) - I got different results (option 1 got the best results in most cases, but very close to the other options results).

So which one will be the most efficient one? Is there another option I haven't thought of?

Upvotes: 0

Views: 506

Answers (2)

ardenit
ardenit

Reputation: 3890

The data structure you need is trie.

In this explanation I mean that t_i are small strings that should be prefixes of words, and s is the large string that contains many words separated with whitespaces.

Just add all t_i in the trie. Then iterate through s characters:

  • If you meet a whitespace, go to the root of trie.

  • If you meet a letter, go from the current trie node to its child, associated with this letter. If there is no path, just skip all letters until you meet the next whitespace. If you reach a node that is linked to one of t_i, add that string to answer.

This algorithm works in O(sum(length(t_i)) + length(s)). I can write some code if needed.

All your algorithms and algorithm suggested by @DudeDoesThings work in O(sum(length(t_i)) * length(s)) which is far slower, especially when it comes to large inputs.

Upvotes: 4

DudeDoesThings
DudeDoesThings

Reputation: 729

If you indeed have many millions of strings and need efficiency I would advice against using either split or regexes. Perhaps you want to look into the Stream API, particularily the parallel streams if computation speed is what you care about:

public static void main(String[] args) {
    Set<String> s = Arrays.stream(new String[] {
        "Stack Overflow",
        "Flowover Stack",
        "Overflow Stack",
        "Stackover Flow"
    }).collect(Collectors.toSet());
    System.out.println(findMatches(s, "Over"));
}

public static Set<String> findMatches (Set<String> names, String input) {
    int inputLength = input.length();
    return names.stream().parallel().filter(name -> {
        int offset = 0;
        while (offset >= 0 && offset + inputLength < name.length()) {
            if (name.startsWith(input, offset)) {
                return true;
            }
            offset = name.indexOf(" ", offset);
            if (offset != -1) {
                offset++;
            }
        }
        return false;
    }).collect(Collectors.toSet());
}

Upvotes: 2

Related Questions