Reputation: 111
I need to implement a java method that gets set of strings and input string, and returns a subset of the strings, containing all strings from the original set that has any word starts with the input string. For example, if a string is "Stack Overflow", and the input is "Over", it should be in the subset. But if a string is "Stack Overflow", and the input is "flow, it should not be in the subset.
public Set<String> findMatches (Set<String> names, String input);
Since the set size is huge (100 milions) I need to do this in the most efficient way. Three ways I tried so far came with confusing results:
I tested these methods and measured times, but surprisingly - for different input values (the set of strings and the input string) - I got different results (option 1 got the best results in most cases, but very close to the other options results).
So which one will be the most efficient one? Is there another option I haven't thought of?
Upvotes: 0
Views: 506
Reputation: 3890
The data structure you need is trie.
In this explanation I mean that t_i
are small strings that should be prefixes of words, and s
is the large string that contains many words separated with whitespaces.
Just add all t_i
in the trie. Then iterate through s
characters:
If you meet a whitespace, go to the root of trie.
If you meet a letter, go from the current trie node to its child, associated with this letter. If there is no path, just skip all letters until you meet the next whitespace. If you reach a node that is linked to one of t_i
, add that string to answer.
This algorithm works in O(sum(length(t_i)) + length(s))
. I can write some code if needed.
All your algorithms and algorithm suggested by @DudeDoesThings work in O(sum(length(t_i)) * length(s))
which is far slower, especially when it comes to large inputs.
Upvotes: 4
Reputation: 729
If you indeed have many millions of strings and need efficiency I would advice against using either split or regexes. Perhaps you want to look into the Stream API, particularily the parallel streams if computation speed is what you care about:
public static void main(String[] args) {
Set<String> s = Arrays.stream(new String[] {
"Stack Overflow",
"Flowover Stack",
"Overflow Stack",
"Stackover Flow"
}).collect(Collectors.toSet());
System.out.println(findMatches(s, "Over"));
}
public static Set<String> findMatches (Set<String> names, String input) {
int inputLength = input.length();
return names.stream().parallel().filter(name -> {
int offset = 0;
while (offset >= 0 && offset + inputLength < name.length()) {
if (name.startsWith(input, offset)) {
return true;
}
offset = name.indexOf(" ", offset);
if (offset != -1) {
offset++;
}
}
return false;
}).collect(Collectors.toSet());
}
Upvotes: 2