Reputation: 43
Given the following text, what PCRE regular expression would you use to extract the parts marked in bold?
00:20314 lorem ipsum want this kryptonite 00:02314 quux padding dont want this 00:03124 foo neither this 00:01324 foo but we want this stalagmite 00:02134 tralala not this 00:03124 bar foo and we want this kryptonite but not this(!) 00:02134 foo bar and not this either 00:01234 dolor sit amet EOF
IOW, we want to extract sections that start, in regex terms, with "^0" and end with "(kryptonite|stalagmite)".
Been chomping on this for a bit, finding it a hard nut to crack. TIA!
Upvotes: 4
Views: 1918
Reputation: 70732
One way to do this would be Negative Lookahead combined with inline (?sm)
dotall and multi-line modifiers.
(?sm)^0(?:(?!^0).)*?(?:kryptonite|stalagmite)
Upvotes: 4
Reputation: 20486
I believe this will be the most efficient:
^0(?:\R(?!\R)|.)*?\b(?:kryptonite|stalagmite)\b
Obviously we start with ^0
and then end with either kryptonite
or stalagmite
(in a non-capturing group, for the heck of it) surrounded by \b
word boundaries.
(?:\R(?!\R)|.)*?
is the interesting part though, so let's break it down. One key concept first is PCRE's \R
newline sequence.
(?: (?# start non-capturing group for repetition)
\R (?# match a newline character)
(?!\R) (?# not followed by another newline)
| (?# OR)
. (?# match any character, except newline)
)*? (?# lazily repeat this group)
Upvotes: 2
Reputation:
This looks like it works.
# (?ms)^0(?:(?!(?:^0|kryptonite|stalagmite)).)*(kryptonite|stalagmite)
(?ms)
^ 0
(?:
(?!
(?: ^ 0 | kryptonite | stalagmite )
)
.
)*
( kryptonite | stalagmite )
Upvotes: 3