Reputation: 53
I've written a Java function that implements the Boyer-Moore algorithm to search for a given substring in a char array. It returns a list of every index where the substring is found in the array. For example, if the char array being searched contained the phrase "The Walking Dead" and the substring given as a parameter was "king", a list of size one containing the value 7 would be returned.
I would like to change this function so that only indexes of substrings that are full words in the char array would be returned. So the previous example would return an empty list, but if the substring was changed to "The", "Walking" or "Dead", lists of size 1 would be returned with values 0, 4, and 12 respectively.
Is this sort of functionality possible to implement using the Boyer-Moore algorithm? Are there any other string searching algorithms that would be able to produce these results efficiently?
Upvotes: 0
Views: 702
Reputation: 3141
Just use Java's Pattern - it already implements Boyer Moore internally. Then '\b' matches a word boundary. As in:
Pattern pattern = Pattern.compile("\\b" + Pattern.quote(needle) + "\\b");
Matcher m = pattern.matcher(haystack);
while (m.find()) {
System.out.println(m.start());
}
Upvotes: 0
Reputation: 719299
Yea, you could tweak Boyer-Moore to do that:
After each "match" you could check that the start and end positions for the match are at word boundaries.
You change the search from "king" to 'word-boundary + "king" + word-boundary' where 'word-boundary' is a pseudo-character that your modified B-M matches against any word-boundary character.
You could pre-process the input to replace all white-space, punctuation, etc with a special character that means "word-boundary", and then search that.
Which of these is likely to be better depends on how you implement them ... and whether you are going to repeatedly search the same input text.
Upvotes: 0
Reputation: 5333
This may not be the sort of answer you want, but you can change the arguments instead of the algorithm: Add a space to the beginning and end of your search string, as well as to the beginning and end of your target string (in case the first or last word are the hit). You'll need to treat punctuation and other non-word characters special as well.
Upvotes: 3