Reputation: 10865
I need to write a regex in perl to do the following.
The starting line is keyword1 (like "this is keyword1"), and the ending line is either keyword2 (like "end1 here") or keyword3 (like "end2 here"). For example, the text file may look like:
*********** this is keyword1***********
*****
..
*******apple***********
******
..
*********** this is keyword1***********
*****
..
*******orange***********
******
..
*********** this is keyword1***********
*****
..
*******orange***********
******
..
My task is to match those blocks
*********** this is keyword1***********
*****
..(comment: no "this is keyword1" here)
*******apple***********
or
*********** this is keyword1***********
*****
.. (comment: no "this is keyword1" here)
*******orange***********
Appreciate your help!
Upvotes: 1
Views: 309
Reputation: 118158
My previous answer missed your revised requirements. Here is the updated code:
#!/usr/bin/env perl
use 5.012;
use strict;
use warnings;
my $text = do { local $/; <DATA> };
my $pat = qr{
(
[^\n]*?
keyword1
.*?
(?:apple|orange)
[^\n]*?
\n
)
}sx;
my $result;
while ($text =~ /$pat/g) {
$result .= "[[[\n$1]]]\n";
}
say $result;
__DATA__
*********** this is keyword1***********
*****
..(comment: no "this is keyword1" here)
*******apple***********
*****
..
*********** this is keyword1***********
*****
..
*******apple***********
******
..
*********** this is keyword1***********
*****
.. (comment: no "this is keyword1" here)
*******orange***********
*****
..
*********** this is keyword1***********
*****
..
*******orange***********
******
..
*********** this is keyword1***********
*****
..
*******orange***********
******
..
Output:
[[[ *********** this is keyword1*********** ***** ..(comment: no "this is keyword1" here) *******apple*********** ]]] [[[ *********** this is keyword1*********** ***** .. *******apple*********** ]]] [[[ *********** this is keyword1*********** ***** .. (comment: no "this is keyword1" here) *******orange*********** ]]] [[[ *********** this is keyword1*********** ***** .. *******orange*********** ]]] [[[ *********** this is keyword1*********** ***** .. *******orange*********** ]]]
The brackets are there to visually verify that correct blocks were matched.
Upvotes: 0
Reputation: 754570
Note that originally 'apple' was spelled 'end1 here' and 'orange' was originally spelled 'end2 here'.
#!/usr/bin/env perl
use strict;
use warnings;
my $printing = 0;
while (<>)
{
$printing = 1 if m/this is keyword1/;
print if $printing;
$printing = 0 if m/end[12] here/;
}
If you want to exclude the end lines from the output, then move that test above the print. If you want to exclude the opening lines from the output, then move that test below the print. Clearly, if you can't combine the two end patterns as easily as in the example, you can simply have two lines:
$printing = 0 if m/the first end pattern/;
$printing = 0 if m/a radically different end marker/;
For the sample data, the output is:
*********** this is keyword1***********
*****
..
*******end1 here***********
*********** this is keyword1***********
*****
..
*******end1 here***********
*********** this is keyword1***********
*****
..
*******end2 here***********
One simple way to meet the revised output requirement is simply to accumulate lines into a string when $printing = 1
:
my $saving = 0;
my $result;
while (<>)
{
$saving = 1 if m/this is keyword1/;
$result .= $_ if $saving;
$saving = 0 if m/end[12] here/;
}
However, this doesn't slurp the whole file into memory, nor does it use m//g
, so it doesn't meet the mechanisms defined for the revised requirements.
With the revised requirements, I think this code does more or less what you want:
#!/usr/bin/env perl
use strict;
use warnings;
my $file;
{
local $/;
$file = <>;
}
my $result;
while ($file =~ m/(^[^\n]*this is keyword1.*?end[12] here[^\n]*$)/gms)
{
print "Found: $1\n";
$result .= "$1\n";
}
print "Overall set of matched material:\n";
print $result;
Clearly, you can omit the printing in the loop if you don't want each paragraph as it is found. Note the use of the non-greedy .*?
to stop the scan in the middle, and the uses of ^
and $
along with the /m
(multi-line) modifier to pick up the whole lines.
The output on the sample data is:
Found: *********** this is keyword1***********
*****
..
*******end1 here***********
Found: *********** this is keyword1***********
*****
..
*******end1 here***********
Found: *********** this is keyword1***********
*****
..
*******end2 here***********
Overall set of matched material:
*********** this is keyword1***********
*****
..
*******end1 here***********
*********** this is keyword1***********
*****
..
*******end1 here***********
*********** this is keyword1***********
*****
..
*******end2 here***********
#!/usr/bin/env perl
use strict;
use warnings;
my $file;
{
local $/;
$file = <>;
}
my $result;
while ($file =~ m/(^[^\n]*this is keyword1.*?(apple|orange)[^\n]*$)/gms)
{
print "Found: $1\n";
$result .= "$1\n";
}
print "Overall set of matched material:\n";
print $result;
Sample data
*********** this is keyword1***********
*****
..
*******orange***********
******
..
*********** this is keyword1***********
*****
..
*******orange***********
******
..
*********** this is keyword1***********
*****
..
*******apple***********
******
Sample output
Found: *********** this is keyword1***********
*****
..
*******orange***********
Found: *********** this is keyword1***********
*****
..
*******orange***********
Found: *********** this is keyword1***********
*****
..
*******apple***********
Overall set of matched material:
*********** this is keyword1***********
*****
..
*******orange***********
*********** this is keyword1***********
*****
..
*******orange***********
*********** this is keyword1***********
*****
..
*******apple***********
$
Upvotes: 1