Reputation: 8486
I'm building a small filter utility for users to quickly filter a list of items and I want to match the beginnings of words in order, preferably using regular expressions:
Consider a user trying to find the item labeled here is some text
.
her — here is some text — \bher
so — here is some text — \bso
ext — no match — \bext
hist — here is some text — \bh.*?\bi.*?\bs.*?\bt
ht — here is some text — \bh.*?\bt
n
characters of several words:herst — here is some text
iso — here is some text
teh — no match
I'm doing this because my items often contain intialisms, and a user may type usc to try and quickly pull up USA, California
I'm rewriting the pattern for each input, so I can do a little work then, as is necessary in case #2. I'm looking for a solution that will scale linearly with character count, in either pattern complexity or total complexity.
Given these constraints, what is my best option for matching these strings?
Upvotes: 1
Views: 88
Reputation: 2742
It's difficult, if not impossible, to do this in a flexible manner with pure regexps. One possible approach that occurs to me is to first attempt a simple regexp match using word boundaries as you've already done, and then to generate a set of all possible prefix and suffix pairs and match against those. If, however, you want to be able to match against arbitrarily more than two separate words within a string, you should probably write a simple function that walks through the string being searched, attempting to match against the longest prefix from the query string. Once you find that longest prefix, you move on to the next word in the search string, and attempt to match against the remainder of the query (that is, minus the already matched prefix), and continue to do this until either the entire query has been matched, or the searched string ends. This should be fairly easy to implement recursively.
Upvotes: 0
Reputation: 117477
I don't think this is doable with the standard regexp libraries.
But given your constraints, you should be able to write your own parser to do the matching. Keep a stack of the pattern and then scan through the input text from start and forth. The only state you need to track is whether the previous character was a boundary or took an item off the stack. If you reach end of input without emptying the stack, it was a no-match.
In pseudo code:
pattern = "herst"
input = "here is some text"
state = true
until input.empty? or pattern.empty? do
if input[0] == pattern[0] and state
pattern.shift!
else
state = is_boundary(input[0])
endif
input.shift!
done
return pattern.empty?
Upvotes: 2
Reputation: 70460
Monstrosities like:
\bh(.*?\b)?e(.*?\b)?r(.*?\b)?s(.*?\b)?t
Essentially, every letter is either preceded by the previous one, or a random sequence ending with a word boundary (.*?\b)
. So, we make this random sequence + \b optional with ?
. So, breaking it up with (.*?\b)?
between all the letters should work.
Upvotes: 1
Reputation: 670
Try using ^<myregex>
for the beginning of the string and <myregex>$
for the end.
Upvotes: -2