Reputation: 229988
Input:
Note - Only the first input changes, the second one is considered a constant, and is available for preprocessing.
Desired Output:
Identify all the matches of the expressions from item 2 inside the text. If there are match ambiguities, take the greedy match if possible.
Runtime should be relatively fast, though no strict performance requirements. A brute force attempt might suffice here.
What is a good algorithm for this? Are suffix trees useful here? How about going over all the expressions and putting them in a hashtable? Also note that I'm interested in practical solutions, so ease of implementation can be more useful than a super efficient algorithm...
Upvotes: 2
Views: 202
Reputation: 31435
The general "algorithm" assuming unlimited storage, for optimising this is to build up a tree on the data based on characters allowing you to search for your pattern recursively. In the tree index as you build, you traverse down until you reach a "unique" point and the "leaf" gives the location of where that unique occurrence is.
In the above paragraph for example the word "index" appears once. If the tree is built one character at a time then the tree path we follow would start with the "i" character then "in". If it is case sensitive there are just 3 occurrences (assuming, optimising and index). When we next search for 'd' we hit our unique result. Of course we could start our search first with the space, then the i and then the n and we'd follow a different path.
You could also make the tree case-insensitive, and you could use a "nybble" rather than a byte at each branch point.
Upvotes: 1