Reputation: 5956
My data has quotes in it, the names of properties in addresses are quoted. e.g.
"21JAN1984:00:00:00","M",""Millfield""," "
PERL Text::CSV dies at this point with an error
CSV_PP ERROR: 2025 - EIQ - Loose unescaped escape
This looks to me like valid CSV, as is a field, "James said "nice". ".
An abbreviated version of the code used is:
my $csv = Text::CSV->new({
binary => 1,
auto_diag => 1,
eol => "\n",
always_quote => 1
}) or die "Cannot use CSV: " . Text::CSV->error_diag();
open my $fh, '<', $ARGV[0] or die $!;
while (my $person = $csv->getline_hr($fh)) {
...
}
Upvotes: 0
Views: 1147
Reputation: 755104
"21JAN1984:00:00:00","M",""Millfield""," "
If you want a double quote before Millfield
and another after it, the correct CSV format is:
"21JAN1984:00:00:00","M","""Millfield"""," "
As written, the CSV data is broken. Or, at any rate, it is not the 'standard' format. You can find a standard specification for CSV as RFC4180. This is not identical to Microsoft's specification; the RFC itself identifies that Excel doesn't use precisely this format.
Since you're using Perl's Text::CSV module, you should read its specification. Note that the allow_loose_quotes
attribute describes input exactly like what you're trying to deal with. It is one of the many attributes you can use to configure the behaviour of Text::CSV
in its new
method.
What was shown in the original version of the question was horribly ill-formed CSV.
21JAN1984:00:00:00","M",""Millfield""," "
The double quote after the 00
has no place in the format. At best, you have to treat it as a regular character at the end of a field delimited by the comma that follows. The "M"
is non-controversial. The ""Millfield""
is malformed; if a string starts with a double quote, it ends at the next double quote unless that is itself followed by another double quote, so the second double quote is erroneous. If a field starts with a double quote, it should be enclosed by double quotes. The best you can do is assume that the field is Millfield""
and stops at the comma, but it is erroneous by any normal rules. Under those error-recovery rules, the " "
at the end is non-controversial.
To be reasonably well-formed and to contain "Millfield"
as a value, you'd need one of these:
"21JAN1984:00:00:00","M","""Millfield"""," "
21JAN1984:00:00:00,"M","""Millfield"""," "
21JAN1984:00:00:00,M,"""Millfield"""," "
21JAN1984:00:00:00,M,"""Millfield""",
The last of those lines has a trailing blank.
Alternatively, if Millfield
should not be surrounded by double quotes when extracted, then all the double quotes are superfluous, though any field could be surrounded by a single pair of double quotes.
Upvotes: 6