asf107
asf107

Reputation: 1146

Parsing strings with quote characters inside fields

Suppose I want to parse the file

$ cat toParse.txt
1 2 3 4 5
1 "2 3" 4 5
1 2" 3 " 4 5 

The first two lines are easy to parse: Text::CSV can handle it. For instance, I tried:

use strict; 
use Text::CSV; 
while() { 
    chomp $_; 
    my $csv = Text::CSV->new({ sep_char => ' ', quote_char => '"' , binary => 1});
    $csv->parse($_); 
    my @fields = $csv->fields(); 
    my $badArg = $csv->error_input(); 
    print "fields[1] = $fields[1]\n"; 
    print "Bad argument: $badArg\n\n"; 
}

However, CSV gets very confused if the quote character is contained within the tokenized field.

The above program prints out:

fields[1] = 2
Bad argument:

fields[1] = 2 3
Bad argument:

fields[1] =
Bad argument: 1 2" 3 " 4 5

Does anyone have any suggestions? I'd like the final fields[1] to be populated with 2" 3 " ... in other words, I want to split the line on any whitespace that is not contained in a quoted string.

Upvotes: 0

Views: 855

Answers (2)

Ωmega
Ωmega

Reputation: 43703

What you want is not CSV, so you need to code your own parsing.

This should work for your particular case:

use strict;

while (<DATA>) { 
    chomp $_;
    my @fields = /([^\s"]+|(?:[^\s"]*"[^"]*"[^\s"]*)+)(?:\s|$)/g;
    print "$_\n" for @fields;
    print "\n";
}

__DATA__

1 2 3 4 5
1 "2 3" 4 5
1 2" 3 " 4 5 
1 2" 3 "4 5 
1 2" 3 "4" 5" 6
1 2" 3 "4"" 5"" 6

...and its output is:

1
2
3
4
5

1
"2 3"
4
5

1
2" 3 "
4
5

1
2" 3 "4
5

1
2" 3 "4" 5"
6

1
2" 3 "4""
5""
6

Click here to test it.

Upvotes: 1

Tony Hopkinson
Tony Hopkinson

Reputation: 20330

Change quote_char to something other that " and the third line would be

1
2"
3
"
4
5

However the second line will now be

1 
"2
3"
4
5

So you would appear to have one line where " is the quote delimiter and one where it isn't.

So the file you are parsing is broke, and you are going to have to get clever.

Upvotes: 0

Related Questions