user
user

Reputation: 25

How to get substring of the line enclosed in double quotes

I have an input string :

ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6

If I use split function, I'm getting weird output.

my ($field1, $field2, $field3, $field4) = "";
while (<DATAFILE>) {
    $row = $_;
    $row =~ s/\r?\n$//;
    ($field1, $field2, $field3, $field4) = split(/,/, $row);
}

output I am getting is:

field1 :: ACC000121
field2 :: 2290
field3 :: "01009900
field4 :: 01009901

Expected output:

field1 = ACC000121
field2 = 2290
field3 = 01009900,01009901,01009902,01009903,01009904
field4 = 4
field5 = 5
field6 = 6

I am quite weak in Perl. Please help me

Upvotes: 0

Views: 134

Answers (2)

Jonathan Leffler
Jonathan Leffler

Reputation: 754010

I agree with Matt Jacob's answer — you should parse CSV with Text::CSV unless you've got a very good reason not to do so.

If you're going to deal with it using regular expressions, I think you'll do better with m// than split. For example, this seems to cover most single line CSV data variants, though it does not remove the quotes around a quoted field as Text::CSV would — that requires a separate post-processing step.

use strict;
use warnings;

sub splitter
{
    my($row) = @_;
    my @fields;
    my $i = 0;

    while ($row =~ m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/g)
    {
        print "Found [$1]\n";
        $fields[$i++] = $1;
    }

    for (my $j = 0; $j < @fields; $j++)
    {
        print "$j = [$fields[$j]]\n";
    }
}

my $row;

$row = q'ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6';
print "Row 1: $row\n";
splitter($row);

$row = q'ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""';
print "Row 2: $row\n";
splitter($row);

Obviously, that has a fair amount of diagnostic code in it. The output (from Perl 5.22.0 on Mac OS X 10.11.1) is:

Row 1: ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6
Found [ACC000121]
Found [2290]
Found ["01009900,01009901,01009902,01009903,01009904"]
Found [4]
Found [5]
Found [6]
0 = [ACC000121]
1 = [2290]
2 = ["01009900,01009901,01009902,01009903,01009904"]
3 = [4]
4 = [5]
5 = [6]
Row 2: ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""
Found [ACC000121]
Found [","]
Found [2290]
Found ["01009900,""aux data"",01009902,01009903,01009904"]
Found []
Found [5"abc"]
Found [6]
Found [""]
0 = [ACC000121]
1 = [","]
2 = [2290]
3 = ["01009900,""aux data"",01009902,01009903,01009904"]
4 = []
5 = [5"abc"]
6 = [6]
7 = [""]

In the Perl code, the match is:

m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/

This looks for and captures (in $1) either an empty field followed by a comma, or for something other than a double quote followed by zero or more non-commas, or for a double quote followed by a sequence of zero or more occurrences of "not a double quote or two consecutive double quotes" and another double quote; it then expects either a comma or end of string.

Handling multi-line fields requires a little more work. Removing the escaping double quotes also requires a little more work.

Using Text::CSV is much simpler and much less error prone (and it can handle more variants than this can).

Upvotes: 0

Matt Jacob
Matt Jacob

Reputation: 6553

If you have CSV data, you really want to use Text::CSV to parse it. As you've discovered, parsing CSV data is usually not as trivial as just splitting on commas, and Text::CSV can handle all the edge cases for you.

use strict;
use warnings;

use Data::Dump;
use Text::CSV;

my $csv = Text::CSV->new;

while (<DATA>) {
    $csv->parse($_);
    my @fields = $csv->fields;
    dd(\@fields);
}

__DATA__
ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6

Output:

[
  "ACC000121",
  2290,
  "01009900,01009901,01009902,01009903,01009904",
  4,
  5,
  6,
]

Upvotes: 4

Related Questions