Reputation: 35
So I have an Array, and I have to search for words
The Array:
0 1 2 3 4 5 6 7 8 9 10 11
text g t c a n d l e t j a q
The Keys:
2 can
2 candle
3 a
3 an
3 and
6 let
10 a
The number is an offset from the start of the array being searched, and the string is a word from the dictionary that was found at that offset. Note that several words can start at the same offset, and the same word can be found in multiple places. Note also that words can overlap.
This is the code I have written:
public ArrayList<Location> findWords(String[] dictionary, String text) {
int keyLength = text.length();
int dtLength = dictionary.length;
ArrayList<Location> results;
results = new ArrayList<>();
for (int k = 0; k < keyLength; k++) {
for (int d = 0; d < dtLength; d++) {
if (dthasKey(dictionary[d], text, k)) {
Location loc = new Location(k, dictionary[d]);
results.add(loc);
}
}
}
return results;
}
private boolean dthasKey(String key, String text, int pos) {
for (int i = 0; i < key.length(); i++) {
if (key.length() >= text.length() - pos)
return false;
while (key.charAt(i) != text.charAt(pos + i)) {
return false;
}
}
return true;
}
I was wondering if there is a better solution to this problem. If you guys may also provide a worst time complexity, that would be great. The one I have written is:
O(k*n*m)
where m is the size of the
dictionary, n is the size of the text, and k is the length of
the longest word.
Upvotes: 2
Views: 2102
Reputation: 134005
The standard solution to your problem is to use the Aho-Corasick string matching algorithm, which builds an automaton from the dictionary and then can quickly find all words in a string that you pass to it. A Google search reveals many Java implementations.
Building the automaton is O(n), where n is the number of characters in all words of the dictionary. But it's a one-time cost. You can use that automaton to search multiple documents for the words.
Searching documents for words is O(m + z), where m is the number of characters in the document, and z is the number of matches found.
I don't know if Aho-Corasick is the fastest algorithm, but it is very fast. And that there are existing Java implementations would be a big plus. But it's not especially difficult to implement. The original paper, Efficient String Matching: an Aid to Bibliographic Search, is very readable although it'll probably take a couple of iterations of reading, pondering, and reading again before it "clicks." And the pseudocode examples are detailed enough that you can use them as the basis of an implementation. I created a C# implementation for an article using that document as my only reference.
Upvotes: 3
Reputation: 8580
You can create an automaton for each of the words (that accepts only that word) and then run the given text through all the automatons simolteanously, this will end up O(m*k^2+n)
Upvotes: 0