Reputation: 867
I have some text like this:
Note: this is example text so the content is unimportant
CAT SAT ON A DOG REASON: No reason CONCERN: He was cold BECAUSE: Cold weather CAT SAT ON A MOUSE REASON: He eats mice CONCERN: He was hungry BECAUSE: Can opener didn't work CAT SAT ON A HORSE REASON: He wants to ride CONCERN: He might fall off BECAUSE: Saddle is too big
I am trying to write a regular expression that could capture only the 'CAT SAT ON A MOUSE' part, but am having problems capturing the full text.
I have tried:
(\bCAT\sSAT\sON\sA\sMOUSE)(.*)\n{2}
The idea was to match the beginning part of the string and then to capture everything up till two line breaks.
{2} is to capture the two line breaks.
I have tried many more variations but all I manage to do is to capture the first line only.
Any sort of help would be really appreciated.
Upvotes: 4
Views: 30483
Reputation:
This might work:
(\bCAT[^\S\n]SAT[^\S\n]ON[^\S\n]A[^\S\n]MOUSE\b[\s\S]*?)\n{2}
or
(\bCAT[^\S\n]+SAT[^\S\n]+ON[^\S\n]+A[^\S\n]+MOUSE\b[\s\S]*?)\n{2}
Edit - The regex must be slowed after the first anchor, otherwise the next anchor
could be passed up in favor of speed. This can be done with a non-greedy quantifier
or a look-ahead assertion (which allows aggressive behavior at the cost of a check
that basically nullifies its speed).
Edit2 - Sometimes it may be desireable to match an 'apparent' gap between paragraphs that could include non-newline whitespace.
For example \n\n
will not match an apparent gap like this:
'start ... \nend of paragraph\n \n' when it should.
In that case, replacing \n{2}
with \n[^\S\n]*\n
will allow it to match.
Furthermore, since the non-greedy quantifier is used (in this case) \b[\s\S]*?
,
it is possible to account for and match the paragraph end when it is at or near the end of file. Putting this all together yeilds:
/(\bCAT\s+SAT\s+ON\s+A\s+MOUSE\b[\s\S]*?)($|\n[^\S\n]*\n)/
which now looks pretty complicated, but does the complete job.
Upvotes: 0
Reputation: 107040
What language are you working with? That'll help a bit. In Perl, you can add the m
specifier to treat the multi-lined string as a single piece of text:
#! /usr/bin/perl
my $string =<<STRING;
CAT SAT ON A MOUSE
REASON: He eats mice
CONCERN: He was hungry
BECAUSE: Can opener didn't work
This is a test, and not part of the string to match.
STRING
if ($string =~ /(^(CAT[^\n]+).*\n\n/s) {
say "Match: $1";
}
else {
say "Didn't match";
}
In Perl, adding the s
on the end treats the enter string as a single line.
Upvotes: 0
Reputation: 75222
I think your main problem is that your text uses \r\n
to separate lines, and you're only looking for \n
. Try this:
/^(CAT +SAT +ON +A +MOUSE)(?:(?:\r\n|[\r\n])[^\r\n]+)*/m
(?:\r\n|[\r\n])
matches any of the three most common line separators (which I'll call newlines): \r\n
, \r
, or \n
. It matches exactly one newline at a time, no matter which kind it is. Then [^\r\n]+
takes over, so there can only be one line separator per line. Since paragraphs are delimited by two newlines, the match ends there.
I took the liberty of anchoring the first line with a start anchor (^
) in multiline mode (m
). It's not absolutely necessary to do that, but helps the regex find a match more quickly, and much importantly, to fail more quickly when no match is possible.
(You haven't said which regex flavor you're working with, so I made a wild guess and used JavaScript syntax.)
Upvotes: 1
Reputation: 9322
You were asking for anything then two line breaks. You needed to ask for a line break followed by anything twice.
Try this one:
(\bCAT\sSAT\sON\sA\sMOUSE)(\n.*){2}
Upvotes: 2