petom
petom

Reputation: 51

Perl regex captures non-capturing groups

I am using perl to parse CSV file. I use regex to parse it instead of any library. I know, there is CSV parser library available, and not only one, but I just decided I want to use REGEX.

I created I think a quite nice and working regular expression for this. I originally created other applications, which take a regular expression only to parse files. And I wanted to reuse it for this case.

I want to use the beauty of perl put it in one line:

my $text = '"",hi there,"","2018-04-23,\" 13:14:53",,hostname,mac,"ipaddress",199';

my @data = $text =~ m/(?:^|,)(?:"(|.*?[^\\])"|([^,]*))(?:|$)/g;

However, when I do that in one-liner perl regex captures even non-capturing groups.

Here is a testing code:

my $text = '"",hi there,"","2018-04-23,\" 13:14:53",,hostname,mac,"ipaddress",199';

my @data = $text =~ m/(?:^|,)(?:"(|.*?[^\\])"|([^,]*))(?:|$)/g;
foreach (@data) { print "a --${_}--\n"; }

while ($text =~ m/(?:^|,)(?:"(|.*?[^\\])"|([^,]*))(?:|$)/cg) {
    print "b --${1}${2}--\n";
}

Results for "a" dump are:

a ----
a ----
a ----
a --hi there--
a ----
a ----
a --2018-04-23,\" 13:14:53--
a ----
a ----
a ----
a ----
a --hostname--
a ----
a --mac--
a --ipaddress--
a ----
a ----
a --199--

You can see there extra empty lines as opposed to the correct results from "b" dump:

b ----
b --hi there--
b ----
b --2018-04-23,\" 13:14:53--
b ----
b --hostname--
b --mac--
b --ipaddress--
b --199--

Has any body met with this issue? Thank you for your answers / ideas / bug findings.

Upvotes: 1

Views: 1574

Answers (1)

petom
petom

Reputation: 51

As soon as I posted my question I realised that issue are not the non-capturing groups, but actually capturing groups, which only one of them has a value and the other is empty at the time.

The culprit is this section of the regex:

(?:"(|.*?[^\\])"|([^,]*))

Everything went all right after replacing non-capturing group with branch-reset feature:

(?|"(|.*?[^\\])"|([^,]*))

So final working correct one-liner is:

my @data = $text =~ m/(?:^|,)(?|"(|.*?[^\\])"|([^,]*))(?:|$)/g;

Hopefully someone will find this information useful.

Upvotes: 2

Related Questions