what's the best data structure for pattern matching within a vocabulary?

Question

Given a set of vocabulary, what's the best data structure that can be used to find all the words within the vocabulary that matches a given sub string?

Suppose "Ap" is the substring,
"Apple" and "Application" should be returned.
Since in this case, "Ap" is at the start of the two strings, I can think of the use of Tries.

But what if the substring to be matched can be found anywhere in the words of the vocabulary?
Eg: If "ap" is given, "shape" should also be returned since "ap" occurs in "shape".

The vocabulary set is very large.

Ira Baxter · Accepted Answer

What you want is a suffix tree. This stores all suffixes of a (set of) strings in a trie (in your case, the set of words). Each leaf of the trie is associated with the the set of strings that have that suffix.

When searching for a substring, you simply match the substring at the root of trie; your substring must be a prefix of some suffix or there is no match. Discovering the existence of a match is linear time in the length of the substring. To determine all matching words, you have to enumerate all leaves of the trie accessible from the point where the match completes. That is a tree walk problem; if the tree has significant branching, it might be bit expensive.

You could precompute, for each trie node, the set of associated words; this is likely to be pretty big, but now you have an extremely fast determination of matching words.

If you only need the members of the set to examine until you find one with some nice property, I'd stick with the enumeration.

what's the best data structure for pattern matching within a vocabulary?

Answers (1)

Related Questions

what&#39;s the best data structure for pattern matching within a vocabulary?

Answers (1)

Related Questions

what's the best data structure for pattern matching within a vocabulary?