An algorithm to search for words in an array of letters

Question

So I have an Array, and I have to search for words

The Array:

        0   1   2   3   4   5   6   7   8   9   10  11
text    g   t   c   a   n   d   l   e   t   j   a   q

The Keys:

2 can
2 candle
3 a
3 an
3 and
6 let
10 a

The number is an offset from the start of the array being searched, and the string is a word from the dictionary that was found at that offset. Note that several words can start at the same offset, and the same word can be found in multiple places. Note also that words can overlap.

This is the code I have written:

public ArrayList findWords(String[] dictionary, String text) {
    int keyLength = text.length();
    int dtLength = dictionary.length;

    ArrayList results;
    results = new ArrayList<>();

    for (int k = 0; k < keyLength; k++) {
        for (int d = 0; d < dtLength; d++) {
            if (dthasKey(dictionary[d], text, k)) {
                Location loc = new Location(k, dictionary[d]);
                results.add(loc);
            }
        }
    }
    return results;
}

private boolean dthasKey(String key, String text, int pos) {
    for (int i = 0; i < key.length(); i++) {
        if (key.length() >= text.length() - pos)
            return false;
        while (key.charAt(i) != text.charAt(pos + i)) {
            return false;
        }
    }
    return true;
}

I was wondering if there is a better solution to this problem. If you guys may also provide a worst time complexity, that would be great. The one I have written is: O(k*n*m) where m is the size of the dictionary, n is the size of the text, and k is the length of the longest word.

Jim Mischel · Accepted Answer

The standard solution to your problem is to use the Aho-Corasick string matching algorithm, which builds an automaton from the dictionary and then can quickly find all words in a string that you pass to it. A Google search reveals many Java implementations.

Building the automaton is O(n), where n is the number of characters in all words of the dictionary. But it's a one-time cost. You can use that automaton to search multiple documents for the words.

Searching documents for words is O(m + z), where m is the number of characters in the document, and z is the number of matches found.

I don't know if Aho-Corasick is the fastest algorithm, but it is very fast. And that there are existing Java implementations would be a big plus. But it's not especially difficult to implement. The original paper, Efficient String Matching: an Aid to Bibliographic Search, is very readable although it'll probably take a couple of iterations of reading, pondering, and reading again before it "clicks." And the pseudocode examples are detailed enough that you can use them as the basis of an implementation. I created a C# implementation for an article using that document as my only reference.

An algorithm to search for words in an array of letters

Answers (2)

Related Questions