Reading in text file with unmatched quotes

Question

I have a large (>1GB) CSV file I'm trying to read into a data frame in R.

The non-numeric fields are enclosed in double-quotes so that internal commas are not interpreted as delimiters. That's well and good. However, there are also sometimes unmatched double-quotes in an entry, like "2" Nails".

What is the best way to work around this? My current plan is to use a text processor like awk to relabel the quoting character from the double-quote " to a non-conflicting character like pipe |. My heuristic for finding quoting characters would be double-quotes next to a comma:

gawk '{gsub(/(^\")|(\"$)/,"|");gsub(/,\"/,",|");gsub(/\",/,"|,");print;}' myfile.txt > newfile.txt

This question is related, but the solution (argument in read.csv of quote="") is not viable for me because my file has non-delimiting commas enclosed in the quotation marks.

Tim Pietzcker · Accepted Answer

Your idea of looking for quotes next to a comma is probably the best thing you can do; you could however try to turn it around and have the regex escape all the quotes that are not next to a comma (or start/end of line):

Search for

(?



and replace all the matches with "". 

R might not be the best tool for this because its regex engine doesn't have a multiline mode, but in Perl it would be a one-liner:

$subject =~ s/(?

Reading in text file with unmatched quotes

Answers (2)

Related Questions