flpoirier
flpoirier

Reputation: 41

How to restrict regular expression to smaller capture

Here's my text:

"A popular resource for the Christian community in the Asheville area."
"I love the acting community in the Orange County area."

I'd like to capture "Asheville" and "Orange County". How can I start capturing from the closest "the" to "area"?

Here's my regex:

/the (.+?) area/

They capture:

"Christian community in the Asheville"
"acting community in the Orange County"

Upvotes: 3

Views: 84

Answers (3)

Mobiusis
Mobiusis

Reputation: 21

(?<=in the)(.*)(?=area)

(?<=) : Look behind command (?=) : Look ahead command, this will exclude the string you type in after the = sign. In this case, 'in the' and 'area' will be excluded from the result.

(.) is used here which is 'greedy', but you can use (.?) to match to the next word typed in the look ahead command.

Upvotes: 2

degant
degant

Reputation: 4981

Use a tempered greedy solution, so that the matching text doesn't contain another the. That way it'll always match the last the

/the (?:(?!the).)+? area/
  • (?:(?!the).)+? represents a tempered greedy dot which matches any character except one that contains the text the. This is mentioned using the negative lookahead (?!the) which tells it to not match the text the. Thus it ensures that the match never contains the text the
  • This can be further enhanced by using capturing groups to just extract the text between the and area and so on. Another way would be to make the and area as lookbehind and lookahead, though will be a bit slower than a capturing group.

Regex101 Demo

Rubular Demo

Read more about tempered greedy solution and when to use it.

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626851

Use a (?:(?!the).)+? tempered greedy token:

/the ((?:(?!the).)+?) area/

See the regex demo. It is almost the same as /the ([^t]*(?:t(?!he)[^t]*)*?) area/, but the latter is a bit more efficient since it is an unrolled pattern.

The (?:(?!the).)+? matches any 1+ chars (as few as possible) that does not start a the character sequence.

To make it safer, add word boundaries to only match whole words:

/\bthe ((?:(?!\bthe\b).)+?) area\b/

Ruby demo:

s = 'I love the acting community in the Orange County area.'
puts s[/the ((?:(?!the).)+?) area/,1]
# => Orange County

NOTE: if you expect the match to span across multiple lines, do not forget to add /m modifier:

/the ((?:(?!the).)+?) area/m
                           ^

Upvotes: 2

Related Questions