Reputation: 3494
I need to extract several sections from a multiline string with Perl. I'm applying the same regex in a while loop. My problem is to get the last section which ends with the file. My workaround is to append the marker. This way the regex will always find and end. Is there a better way to do it?
Example file:
Header
==== /home/src/file1.c#1 ====
content file1
line 1 of file1
line 2 of file1
line 3 of file1
another line of file1
==== /home/src/file2.c#1 ====
content file2
line 1 of file2
line 2 of file2
line 3 of file2
another line of file2
Perl script:
#!/usr/bin/env perl
my $desc = do { local $/ = undef; <> };
$desc .= "\n===="; # set the end marker
while($desc =~ /^==== (?<filename>.*?)#.*?====$(?<content>.*?)(?=^====)/mgsp) {
print "filename=", $+{filename}, "\n";
print "content=", $+{content}, "\n";
}
This way the script finds both segments. How can I avoid adding the marker?
Upvotes: 2
Views: 1144
Reputation: 126722
You've made this more awkward by slurping the whole file in the first place. This is relatively simple if you read the file line-by-line
use strict;
use warnings 'all';
my $file;
while ( <> ) {
if ( /^====\s+(.*\S)#\S*\s+====/ ) {
$file = $1;
print "filename=$file\n";
print 'content=';
}
elsif ( $file ) {
print;
}
}
filename=/home/src/file1.c
content=content file1
line 1 of file1
line 2 of file1
line 3 of file1
another line of file1
filename=/home/src/file2.c
content=content file2
line 1 of file2
line 2 of file2
line 3 of file2
another line of file2
Alternatively, if you need to store the whole content per file, perhaps as a hash, it would look like this
use strict;
use warnings 'all';
my $file;
my %data;
while ( <> ) {
if ( /^====\s+(.*\S)#\S*\s+====/ ) {
$file = $1;
}
elsif ( $file ) {
$data{$file} .= $_;
}
}
for my $file ( sort keys %data ) {
print "filename=$file\n";
print "content=$data{$file}";
}
The output is identical to that of the first version above
Upvotes: 1
Reputation: 385789
Use of the greediness modifier ?
is a giant red flag. You can usually get away with using it once in a pattern, but more than that is usually a bug. If you want to match text that doesn't contain a string, use the following instead:
(?:(?!STRING).)*
So that gets you the following:
/
^==== [ ] (?<filename> [^\n]+ ) [ ] ====\n
(?<content> (?:(?! ^==== ).)* )
/xsmg
Code:
my $desc = do { local $/; <DATA> };
while (
$desc =~ /
^==== [ ] (?<filename> [^\n]+ ) [ ] ====\n
(?<content> (?:(?! ^==== ).)* )
/xsmg
) {
print "filename=<<$+{filename}>>\n";
print "content=<<$+{content}>>\n";
}
__DATA__
Header
==== /home/src/file1.c#1 ====
content file1
line 1 of file1
line 2 of file1
line 3 of file1
another line of file1
==== /home/src/file2.c#1 ====
content file2
line 1 of file2
line 2 of file2
line 3 of file2
another line of file2
Output:
filename=<</home/src/file1.c#1>>
content=<<content file1
line 1 of file1
line 2 of file1
line 3 of file1
another line of file1
>>
filename=<</home/src/file2.c#1>>
content=<<content file2
line 1 of file2
line 2 of file2
line 3 of file2
another line of file2
>>
Upvotes: 4