Interlated
Interlated

Reputation: 5956

How do I parse CSV in PERL that has quotes in a field?

My data has quotes in it, the names of properties in addresses are quoted. e.g.

"21JAN1984:00:00:00","M",""Millfield""," "

PERL Text::CSV dies at this point with an error

CSV_PP ERROR: 2025 - EIQ - Loose unescaped escape

This looks to me like valid CSV, as is a field, "James said "nice". ".

An abbreviated version of the code used is:

my $csv = Text::CSV->new({
  binary => 1,
  auto_diag => 1,
  eol => "\n",
  always_quote => 1
}) or die "Cannot use CSV: " . Text::CSV->error_diag();

open my $fh, '<',  $ARGV[0] or die $!;
while (my $person = $csv->getline_hr($fh)) {
  ...
}

Upvotes: 0

Views: 1147

Answers (1)

Jonathan Leffler
Jonathan Leffler

Reputation: 755104

Addressing the revised question

"21JAN1984:00:00:00","M",""Millfield""," "

If you want a double quote before Millfield and another after it, the correct CSV format is:

"21JAN1984:00:00:00","M","""Millfield"""," "

As written, the CSV data is broken. Or, at any rate, it is not the 'standard' format. You can find a standard specification for CSV as RFC4180. This is not identical to Microsoft's specification; the RFC itself identifies that Excel doesn't use precisely this format.

Since you're using Perl's Text::CSV module, you should read its specification. Note that the allow_loose_quotes attribute describes input exactly like what you're trying to deal with. It is one of the many attributes you can use to configure the behaviour of Text::CSV in its new method.

Addressing the original question

What was shown in the original version of the question was horribly ill-formed CSV.

21JAN1984:00:00:00","M",""Millfield""," "

The double quote after the 00 has no place in the format. At best, you have to treat it as a regular character at the end of a field delimited by the comma that follows. The "M" is non-controversial. The ""Millfield"" is malformed; if a string starts with a double quote, it ends at the next double quote unless that is itself followed by another double quote, so the second double quote is erroneous. If a field starts with a double quote, it should be enclosed by double quotes. The best you can do is assume that the field is Millfield"" and stops at the comma, but it is erroneous by any normal rules. Under those error-recovery rules, the " " at the end is non-controversial.

To be reasonably well-formed and to contain "Millfield" as a value, you'd need one of these:

"21JAN1984:00:00:00","M","""Millfield"""," "
21JAN1984:00:00:00,"M","""Millfield"""," "
21JAN1984:00:00:00,M,"""Millfield"""," "
21JAN1984:00:00:00,M,"""Millfield""", 

The last of those lines has a trailing blank.

Alternatively, if Millfield should not be surrounded by double quotes when extracted, then all the double quotes are superfluous, though any field could be surrounded by a single pair of double quotes.

Upvotes: 6

Related Questions