JeffThompson
JeffThompson

Reputation: 1600

Regex match back to a period or start of string

I'd like to match a word, then get everything before it up to the first occurance of a period or the start of the string.

For example, given this string and searching for the word "regex":

s = 'Do not match this. Or this. Or this either. I like regex. It is hard, but regex is also rewarding.'

It should return:

>> I like regex.
>> It is hard, but regex is also rewarding.

I'm trying to get my head around look-aheads and look-behinds, but (it seems) you can't easily look back until you hit something, only if it's immediately next to your pattern. I can get pretty close with this:

pattern = re.compile(r'(?:(?<=\.)|(?<=^))(.*?regex.*?\.)')

But it gives me the first period, then everything up to "regex":

>> Do not match this. Or this. Or this either. I like regex.  # no!
>> It is hard, but regex is also rewarding.                   # correct

Upvotes: 6

Views: 3602

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89584

You don't need to use lookarounds to do that. The negated character class is your best friend:

(?:[^\s.][^.]*)?regex[^.]*\.?

or

[^.]*regex[^.]*\.?

this way you take any characters before the word "regex" and forbids any of these characters to be a dot.

The first pattern stripes white-spaces on the left, the second one is more basic.

About your pattern:

Don't forget that a regex engine tries to succeed at each position from the left to the right of the string. That's why something like (?:(?<=\.)|(?<=^)).*?regex doesn't always return the shortest substring between a dot or the start of the string and the word "regex", even if you use a non-greedy quantifier. The leftmost position always wins and a non-greedy quantifier takes characters until the next subpattern succeeds.

As an aside, one more time, the negated character class can be useful:
to shorten (?:(?<=\.)|(?<=^)) you can write (?<![^.])

Upvotes: 7

Related Questions