For input string with multiple words - what is the most efficient way to check if any of them start with some other string?

Question

I need to implement a java method that gets set of strings and input string, and returns a subset of the strings, containing all strings from the original set that has any word starts with the input string. For example, if a string is "Stack Overflow", and the input is "Over", it should be in the subset. But if a string is "Stack Overflow", and the input is "flow, it should not be in the subset.

public Set findMatches (Set names, String input);

Since the set size is huge (100 milions) I need to do this in the most efficient way. Three ways I tried so far came with confusing results:

Split each string by a blank space and get array of strings, and then, on each of the items in the array - invoke String's startsWith method.
For each string, check if it starts with the input, of contains " " + input (blank space followed by the input).
Regex.

I tested these methods and measured times, but surprisingly - for different input values (the set of strings and the input string) - I got different results (option 1 got the best results in most cases, but very close to the other options results).

So which one will be the most efficient one? Is there another option I haven't thought of?

DudeDoesThings · Accepted Answer

If you indeed have many millions of strings and need efficiency I would advice against using either split or regexes. Perhaps you want to look into the Stream API, particularily the parallel streams if computation speed is what you care about:

public static void main(String[] args) {
    Set s = Arrays.stream(new String[] {
        "Stack Overflow",
        "Flowover Stack",
        "Overflow Stack",
        "Stackover Flow"
    }).collect(Collectors.toSet());
    System.out.println(findMatches(s, "Over"));
}

public static Set findMatches (Set names, String input) {
    int inputLength = input.length();
    return names.stream().parallel().filter(name -> {
        int offset = 0;
        while (offset >= 0 && offset + inputLength < name.length()) {
            if (name.startsWith(input, offset)) {
                return true;
            }
            offset = name.indexOf(" ", offset);
            if (offset != -1) {
                offset++;
            }
        }
        return false;
    }).collect(Collectors.toSet());
}

For input string with multiple words - what is the most efficient way to check if any of them start with some other string?

Answers (2)

Related Questions