Reputation: 1146
Suppose I want to parse the file
$ cat toParse.txt 1 2 3 4 5 1 "2 3" 4 5 1 2" 3 " 4 5
The first two lines are easy to parse: Text::CSV
can handle it. For instance, I tried:
use strict; use Text::CSV; while() { chomp $_; my $csv = Text::CSV->new({ sep_char => ' ', quote_char => '"' , binary => 1}); $csv->parse($_); my @fields = $csv->fields(); my $badArg = $csv->error_input(); print "fields[1] = $fields[1]\n"; print "Bad argument: $badArg\n\n"; }
However, CSV gets very confused if the quote character is contained within the tokenized field.
The above program prints out:
fields[1] = 2 Bad argument: fields[1] = 2 3 Bad argument: fields[1] = Bad argument: 1 2" 3 " 4 5
Does anyone have any suggestions? I'd like the final fields[1]
to be populated with 2" 3 "
... in other words, I want to split the line on any whitespace that is not contained in a quoted string.
Upvotes: 0
Views: 855
Reputation: 43703
What you want is not CSV, so you need to code your own parsing.
This should work for your particular case:
use strict;
while (<DATA>) {
chomp $_;
my @fields = /([^\s"]+|(?:[^\s"]*"[^"]*"[^\s"]*)+)(?:\s|$)/g;
print "$_\n" for @fields;
print "\n";
}
__DATA__
1 2 3 4 5
1 "2 3" 4 5
1 2" 3 " 4 5
1 2" 3 "4 5
1 2" 3 "4" 5" 6
1 2" 3 "4"" 5"" 6
...and its output is:
1
2
3
4
5
1
"2 3"
4
5
1
2" 3 "
4
5
1
2" 3 "4
5
1
2" 3 "4" 5"
6
1
2" 3 "4""
5""
6
Click here to test it.
Upvotes: 1
Reputation: 20330
Change quote_char to something other that " and the third line would be
1
2"
3
"
4
5
However the second line will now be
1
"2
3"
4
5
So you would appear to have one line where " is the quote delimiter and one where it isn't.
So the file you are parsing is broke, and you are going to have to get clever.
Upvotes: 0