Reputation: 307

Why is this regex returning more groups than it should?

I was going through a popular book on regex and found this piece of regex, which is supposed to pick out values from a line containing comma-separated values.

This is supposed to handle double quotes, with "" being treated as an escaped double quote (the sequence "" is allowed within another pair of double quotes)

Here's the perl script I wrote for this:

$str = "Ten Thousand,10000, 2710 ,,\"10,000\",\"It's \"\"10 Grand\"\", baby\",10K";
#$regex = qr"(?:^|,)(?:\"((?:[^\"]|\"\")+)\"|([^\",]+))*";
$regex = qr!
        (?: ^|,)
        (?: 
            "
                ( (?: [^"] | "" )+ )
            "
            |
            ( [^",]+ )
        )
    !x;

@matches = ($str =~ m#$regex#g);
print "\nString : $str\n";
if (scalar(@matches) > 0 ) {
    print "\nMatches\n";
    print "\nNumber of groups: ", scalar(@matches), "\n";
    for ($i=0; $i < scalar(@matches); $i++) {
        print "\nGroup $i - |$matches[$i]|\n";
    }
}
else {
    print "\nDoesnt match\n";
}

This is the output I'm expecting (which is also what's expected by the author, as far as I can make out):

String : Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K
   Matches
   Number of groups: 7
   Group 0 - |Ten Thousand|
   Group 1 - |10000|
   Group 2 - | 2710 |
   Group 3 - |10,000|
   Group 4 - ||
   Group 5 - |It's ""10 Grand"", baby|
   Group 6 - |10K|

This is the output I'm actually getting:

String : Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K
   Matches
   Number of groups: 12
   Group 0 - ||
   Group 1 - |Ten Thousand|
   Group 2 - ||
   Group 3 - |10000|
   Group 4 - ||
   Group 5 - | 2710 |
   Group 6 - |10,000|
   Group 7 - ||
   Group 8 - |It's ""10 Grand"", baby|
   Group 9 - ||
   Group 10 - ||
   Group 11 - |10K|

Could someone please explain why are there empty groups in the actual output (apart from the one before 10,000, which is expected)? I copied the regex directly from the book, so is there something else I'm doing wrong?

TIA

Upvotes: 3

Answers (3)

Ron Bergin

Reputation: 1068

That regex has 2 capturing groups and several non capturing groups. When you applied the regex to the string, you used the g modifier tells it to continue to match as many times it can. In this case the pattern matched 6 times each time returning the 2 captured groups for a total of 12 elements in the array.

The regular expression:

(?-imsx:!
        (?: ^|,)

        (?:

            "

                ( (?: [^"] | "" )+ )

            "

            |

            ( [^",]+ )
        )
    !x)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  !                        '!\n        '
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
    ^                        the beginning of the string
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ,                        ','
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
                           '\n\n        '
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
                  "          '\n\n            "\n\n                '
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
      (?:                      group, but do not capture (1 or more
                               times (matching the most amount
                               possible)):
----------------------------------------------------------------------
                                 ' '
----------------------------------------------------------------------
        [^"]                     any character except: '"'
----------------------------------------------------------------------
                                 ' '
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
         ""                      ' "" '
----------------------------------------------------------------------
      )+                       end of grouping
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
                  "          '\n\n            "\n\n            '
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
                             '\n\n            '
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
      [^",]+                   any character except: '"', ',' (1 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
                               ' '
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
                             '\n        '
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
       !x                  '\n    !x'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

TLP already mentioned you could also use the Text::CSV module. Here's that example.

#!/usr/bin/perl

use strict;
use warnings;
use Text::CSV_XS;
use Data::Dumper;

my $csv = Text::CSV_XS->new({binary => 1, eol => $/, allow_whitespace => 1});

while (my $row = $csv->getline (*DATA)) {
    print Dumper $row;
}

__DATA__
Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K;

Outputs:

$VAR1 = [
          'Ten Thousand',
          '10000',
          '2710',
          '',
          '10,000',
          'It\'s "10 Grand", baby',
          '10K;'
        ];

Upvotes: 2

user557597

Reputation:

I concur with @RonBergin. Capture groups are always preserved.
So if you have 2 capture groups times 6 matches, that would produce an
array of 12 elements.

It looks like you want to trim and the way to conjoin the capture groups into one
is to use a Branch Reset which will make a parallel pipe.

I don't want to actually change your regex, however, the example below uses
the branch reset with some robust additions.

 # (?:^|,)(?|\s*"((?:[^"]|"")*)"\s*|\s*([^",]*?)\s*)(?=,|$)

 (?: ^ | , )                     # BOL or comma
 (?|                             # Start Branch Reset
      \s* 
      "
      (                               # (1 start), Quoted content
           (?: [^"] | "" )*
      )                               # (1 end)
      "
      \s* 
   |  
      \s*                             # Whitespace trim
      ( [^",]*? )                     # (1), Optional Non-quoted content
      \s*                             # Whitespace trim
 )                               # End Branch Reset
 (?= , | $ )                     # Lookahead for comma or EOL
                                 # (needed because content is optional)

Upvotes: 1

TLP

Reputation: 67918

You might find the Perl 5 core module Text::ParseWords useful. It does all you are trying to do with just a few lines of code. Also note that you can use the q() and qq() to emulate single and double quote so that you do not have to escape quotes. They can also be used with pretty much any punctuation character, as most perl quote-like operators.

use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;

my $str = q(Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K);
my @words = quotewords(',', 1, $str);
print Dumper \@words;

Output:

$VAR1 = [
          'Ten Thousand',
          '10000',
          ' 2710 ',
          '',
          '"10,000"',
          '"It\'s ""10 Grand"", baby"',
          '10K'
        ];

(Note: The escaped single quote in It\'s is from Data::Dumper)

If your data is proper csv data, you can use Text::CSV instead.

Upvotes: 1

Why is this regex returning more groups than it should?

Answers (3)

Related Questions