Reputation: 13363
I have a large (>1GB) CSV file I'm trying to read into a data frame in R.
The non-numeric fields are enclosed in double-quotes so that internal commas are not interpreted as delimiters. That's well and good. However, there are also sometimes unmatched double-quotes in an entry, like "2" Nails"
.
What is the best way to work around this? My current plan is to use a text processor like awk to relabel the quoting character from the double-quote "
to a non-conflicting character like pipe |
. My heuristic for finding quoting characters would be double-quotes next to a comma:
gawk '{gsub(/(^\")|(\"$)/,"|");gsub(/,\"/,",|");gsub(/\",/,"|,");print;}' myfile.txt > newfile.txt
This question is related, but the solution (argument in read.csv
of quote=""
) is not viable for me because my file has non-delimiting commas enclosed in the quotation marks.
Upvotes: 1
Views: 1815
Reputation: 423
This would be a more foolproof variant of Tim's solution, in case non-boundary commas exist inside the cell:
(?<!,\s+)"(?!\s+,$)
I'm not sure if it would have any bugs though.
Upvotes: 0
Reputation: 336108
Your idea of looking for quotes next to a comma is probably the best thing you can do; you could however try to turn it around and have the regex escape all the quotes that are not next to a comma (or start/end of line):
Search for
(?<!^|,)"(?!,|$)
and replace all the matches with ""
.
R might not be the best tool for this because its regex engine doesn't have a multiline mode, but in Perl it would be a one-liner:
$subject =~ s/(?<!^|,)"(?!,|$)/""/mg;
Upvotes: 2