Reputation: 4325

Why do I get the first capture group only?

(https://stackoverflow.com/a/2304626/6607497 and https://stackoverflow.com/a/37004214/6607497 did not help me)

Analyzing a problem with /proc/stat in Linux I started to write a small utility, but I can't get the capture groups the way I wanted. Here is the code:

#!/usr/bin/perl
use strict;
use warnings;

if (open(my $fh, '<', my $file = '/proc/stat')) {
    while (<$fh>) {
        if (my ($cpu, @vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
            print "$cpu $#vals\n";
        }
    }
    close($fh);
} else {
    die "$file: $!\n";
}

For example with these input lines I get the output:

> cat /proc/stat
cpu  2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106  ...

So the match actually works, but I don't get the capture groups into @vals (perls 5.18.2 and 5.26.1).

Upvotes: 3

Answers (7)

zdim

Reputation: 66881

Only the last of the repeated matches from a single pattern is captured.

Instead, can just split the line and then check on -- and adjust -- the first field

while (<$fh>) {
    my ($cpu, @vals) = split;
    next if not $cpu =~ s/^cpu//;
    print "$cpu $#vals\n";
}

If the first element of the split's return doesn't start with cpu the regex substition fails and so the line is skipped. Otherwise, you get the number following cpu (or an empty string), as in OP.^†

Or, can use the particular structure of the line you process

while (<$fh>) {
    if (my ($cpu, @vals) = map { split } /^cpu([0-9]*) \s+ (.*)/x) { 
        print "$cpu $#vals\n";
    }
}

The regex returns two items and each is split in the map, except that the first one is just passed as is into $cpu (being either a number or an empty string), while the other yields the numbers.

Both these produce the needed output in my tests.

^† Since we always check for ^cpu (and remove it) it makes sense to do that first, and only then split -- when needed. However, that gets a little tricky for the following reason.

That bare split strips the leading (and trailing) whitespaces by its default, so for lines where cpu string has no trailing digits (cpu 2709779...) we would end up having the next number for what should be the cpu designation! A quiet error.

Thus we need to specify for split to use spaces, as it then leaves the leading spaces

while (<$fh>) {
    next if not s/^cpu//;
    my ($cpu, @vals) = split /\s+/;  # now $cpu may be space(s)
    print "$cpu $#vals\n";
}

This now works as intended as the cpu without trailing numbers gets space(s), a case to handle but clear. But this is misleading and an unaware maintainer -- or us the proverbial six months later -- may be tempted to remove the seemingly "unneeded" /\s+/, introducing an error.

Upvotes: 6

hoffmeister

Reputation: 612

he's my example. I thought I'd add it because I like simple code. It also allows "cpu7" with no trailing digits.

#!/usr/bin/perl
use strict;
use warnings;

my $file = "/proc/stat";
open(my $fh, "<", $file) or die "$file: $!\n";
while (<$fh>) 
{
  if ( /^cpu(\d+)(\s+)?(.*)$/ ) 
  {
    my $cpu = $1; 
    my $vals = scalar split( /\s+/, $3 ) ;
    print "$cpu $vals\n";
  }
}
close($fh);

Upvotes: 0

brian d foy

Reputation: 132822

In an exercise for Learning Perl, we state a problem that's easy to solve with two simple regexes but hard with one (but then in Mastering Perl I pull out the big guns). We don't tell people this because we want to highlight the natural behavior to try to write everything in a single regex. Some of the contortions in other answers remind me of that, and I wouldn't want to maintain any of them.

First, there's the issue of only processing the interesting lines. Then, once we have that line, grab all the numbers. Translating that problem statement into code is very simple and straightforward. No acrobatics here because assertions and anchors do most of the work:

use v5.10;

while( <DATA> ) {
    next unless /\A cpu(\d*) \s /ax;
    my $cpu = $1;
    my @values = / \b (\d+) \b /agx;
    say "$cpu " . @values;
    }

__END__
cpu  2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106  ...

Note that the OP still has to decide how to handle the cpu case with no trailing digits. Don't know what you want to do with the empty string.

Upvotes: 2

Georg Mavridis

Reputation: 2341

Just adding to Tim's answer:

You can capture multiple values with one group (using the g-modifier), but then you have to split the statement.

    if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+))+$/) {
        my @vals= /(?:\s+(\d+))/g;
        print "$cpu $#vals\n";
    }

Upvotes: -1

pii_ke

Reputation: 2891

Going by the example input, following content inside the while loop should work.

if (/^cpu(\d*)/) {
    my $cpu = $1;
    my (@vals) = /(?:\s+(\d+))+/g;
    print "$cpu $#vals\n";
}

Upvotes: 2

U. Windl

Reputation: 4325

Replacing

    while (<$fh>) {
        if (my ($cpu, @vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {

with

    while (<$fh>) {
        my @vals;
        if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+)(?{ push(@vals, $^N) }))+$/) {

does what I wanted (requires perl 5.8 or newer).

Upvotes: 0

Tim Biegeleisen

Reputation: 521457

Perl's regex engine will only remember the last capture group from a repeated expression. If you want to capture each number in a separate capture group, then one option would be to use an explicit regex pattern:

if (open(my $fh, '<', my $file = '/proc/stat')) {
    while (<$fh>) {
        if (my ($cpu, @vals) = /^cpu(\d*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$/) {
            print "$cpu $#vals\n";
        }
    }
    close($fh);
} else {
    die "$file: $!\n";
}

Upvotes: 1

Why do I get the first capture group only?

Answers (7)

Related Questions