iali
iali

Reputation: 867

Regular expression to capture multiple lines

I have some text like this:

Note: this is example text so the content is unimportant

CAT SAT ON A DOG
REASON:  No reason
CONCERN:  He was cold
BECAUSE:  Cold weather

CAT SAT ON A MOUSE
REASON:  He eats mice
CONCERN:  He was hungry
BECAUSE:  Can opener didn't work

CAT SAT ON A HORSE
REASON:  He wants to ride
CONCERN:  He might fall off
BECAUSE:  Saddle is too big

I am trying to write a regular expression that could capture only the 'CAT SAT ON A MOUSE' part, but am having problems capturing the full text.

I have tried:

(\bCAT\sSAT\sON\sA\sMOUSE)(.*)\n{2}

The idea was to match the beginning part of the string and then to capture everything up till two line breaks.

{2} is to capture the two line breaks.

I have tried many more variations but all I manage to do is to capture the first line only.

Any sort of help would be really appreciated.

Upvotes: 4

Views: 30483

Answers (4)

user557597
user557597

Reputation:

This might work:

(\bCAT[^\S\n]SAT[^\S\n]ON[^\S\n]A[^\S\n]MOUSE\b[\s\S]*?)\n{2}
or
(\bCAT[^\S\n]+SAT[^\S\n]+ON[^\S\n]+A[^\S\n]+MOUSE\b[\s\S]*?)\n{2}

Edit - The regex must be slowed after the first anchor, otherwise the next anchor
could be passed up in favor of speed. This can be done with a non-greedy quantifier
or a look-ahead assertion (which allows aggressive behavior at the cost of a check
that basically nullifies its speed).

Edit2 - Sometimes it may be desireable to match an 'apparent' gap between paragraphs that could include non-newline whitespace.

For example \n\n will not match an apparent gap like this:
'start ... \nend of paragraph\n \n' when it should.

In that case, replacing \n{2} with \n[^\S\n]*\n will allow it to match.
Furthermore, since the non-greedy quantifier is used (in this case) \b[\s\S]*?,
it is possible to account for and match the paragraph end when it is at or near the end of file. Putting this all together yeilds:

/(\bCAT\s+SAT\s+ON\s+A\s+MOUSE\b[\s\S]*?)($|\n[^\S\n]*\n)/

which now looks pretty complicated, but does the complete job.

Upvotes: 0

David W.
David W.

Reputation: 107040

What language are you working with? That'll help a bit. In Perl, you can add the m specifier to treat the multi-lined string as a single piece of text:

#! /usr/bin/perl

my $string =<<STRING;
CAT SAT ON A MOUSE
REASON:  He eats mice
CONCERN:  He was hungry
BECAUSE:  Can opener didn't work

This is a test, and not part of the string to match.
STRING

if ($string =~ /(^(CAT[^\n]+).*\n\n/s) {
    say "Match: $1";
}
else {
    say "Didn't match";
}

In Perl, adding the s on the end treats the enter string as a single line.

Upvotes: 0

Alan Moore
Alan Moore

Reputation: 75222

I think your main problem is that your text uses \r\n to separate lines, and you're only looking for \n. Try this:

/^(CAT +SAT +ON +A +MOUSE)(?:(?:\r\n|[\r\n])[^\r\n]+)*/m

(?:\r\n|[\r\n]) matches any of the three most common line separators (which I'll call newlines): \r\n, \r, or \n. It matches exactly one newline at a time, no matter which kind it is. Then [^\r\n]+ takes over, so there can only be one line separator per line. Since paragraphs are delimited by two newlines, the match ends there.

I took the liberty of anchoring the first line with a start anchor (^) in multiline mode (m). It's not absolutely necessary to do that, but helps the regex find a match more quickly, and much importantly, to fail more quickly when no match is possible.

(You haven't said which regex flavor you're working with, so I made a wild guess and used JavaScript syntax.)

Upvotes: 1

Jacob Eggers
Jacob Eggers

Reputation: 9322

You were asking for anything then two line breaks. You needed to ask for a line break followed by anything twice.

Try this one:

(\bCAT\sSAT\sON\sA\sMOUSE)(\n.*){2}

Upvotes: 2

Related Questions