I'm building a small filter utility for users to quickly filter a list of items and I want to match the beginnings of words in order, preferably using regular expressions: Consider a user trying to find the item labeled here is some text . I already know how to make it match the beginning of any one word: her — her e is some text — \bher so — here is so me text — \bso ext — no match — \bext And I know how to make it match first letter of several words: hist — h ere i s s ome t ext — \bh.*?\bi.*?\bs.*?\bt ht — h ere is some t ext — \bh.*?\bt What I need is the ability to match the first n characters of several words: herst — her e is s ome t ext iso — here i s so me text teh — no match I'm doing this because my items often contain intialisms, and a user may type usc to try and quickly pull up US A, C alifornia I'm rewriting the pattern for each input, so I can do a little work then, as is necessary in case #2. I'm looking for a solution that will scale linearly with character count, in either pattern complexity or total complexity. Given these constraints, what is my best option for matching these strings?

Reputation: 8486

Can I match the beginnings of words?

I'm building a small filter utility for users to quickly filter a list of items and I want to match the beginnings of words in order, preferably using regular expressions:

Consider a user trying to find the item labeled here is some text.

I already know how to make it match the beginning of any one word:

her — here is some text — \bher
so — here is some text — \bso
ext — no match — \bext

And I know how to make it match first letter of several words:

hist — here is some text — \bh.*?\bi.*?\bs.*?\bt
ht — here is some text — \bh.*?\bt

What I need is the ability to match the first n characters of several words:

herst — here is some text
iso — here is some text
teh — no match

I'm doing this because my items often contain intialisms, and a user may type usc to try and quickly pull up USA, California

I'm rewriting the pattern for each input, so I can do a little work then, as is necessary in case #2. I'm looking for a solution that will scale linearly with character count, in either pattern complexity or total complexity.

Given these constraints, what is my best option for matching these strings?

Upvotes: 1

Answers (4)

Greg E.

Reputation: 2742

It's difficult, if not impossible, to do this in a flexible manner with pure regexps. One possible approach that occurs to me is to first attempt a simple regexp match using word boundaries as you've already done, and then to generate a set of all possible prefix and suffix pairs and match against those. If, however, you want to be able to match against arbitrarily more than two separate words within a string, you should probably write a simple function that walks through the string being searched, attempting to match against the longest prefix from the query string. Once you find that longest prefix, you move on to the next word in the search string, and attempt to match against the remainder of the query (that is, minus the already matched prefix), and continue to do this until either the entire query has been matched, or the searched string ends. This should be fairly easy to implement recursively.

Upvotes: 0

troelskn

Reputation: 117477

I don't think this is doable with the standard regexp libraries.

But given your constraints, you should be able to write your own parser to do the matching. Keep a stack of the pattern and then scan through the input text from start and forth. The only state you need to track is whether the previous character was a boundary or took an item off the stack. If you reach end of input without emptying the stack, it was a no-match.

In pseudo code:

pattern = "herst"
input = "here is some text"
state = true
until input.empty? or pattern.empty? do
  if input[0] == pattern[0] and state
    pattern.shift!
  else
    state = is_boundary(input[0])
  endif
  input.shift!
done
return pattern.empty?

Upvotes: 2

Wrikken

Reputation: 70460

Monstrosities like:

 \bh(.*?\b)?e(.*?\b)?r(.*?\b)?s(.*?\b)?t

Essentially, every letter is either preceded by the previous one, or a random sequence ending with a word boundary (.*?\b). So, we make this random sequence + \b optional with ?. So, breaking it up with (.*?\b)? between all the letters should work.

Upvotes: 1

korko

Reputation: 670

Try using ^<myregex> for the beginning of the string and <myregex>$for the end.

Upvotes: -2

Can I match the beginnings of words?

Answers (4)

Related Questions