Blue Magister
Blue Magister

Reputation: 13363

Reading in text file with unmatched quotes

I have a large (>1GB) CSV file I'm trying to read into a data frame in R.

The non-numeric fields are enclosed in double-quotes so that internal commas are not interpreted as delimiters. That's well and good. However, there are also sometimes unmatched double-quotes in an entry, like "2" Nails".

What is the best way to work around this? My current plan is to use a text processor like awk to relabel the quoting character from the double-quote " to a non-conflicting character like pipe |. My heuristic for finding quoting characters would be double-quotes next to a comma:

gawk '{gsub(/(^\")|(\"$)/,"|");gsub(/,\"/,",|");gsub(/\",/,"|,");print;}' myfile.txt > newfile.txt 

This question is related, but the solution (argument in read.csv of quote="") is not viable for me because my file has non-delimiting commas enclosed in the quotation marks.

Upvotes: 1

Views: 1815

Answers (2)

ptrcao
ptrcao

Reputation: 423

This would be a more foolproof variant of Tim's solution, in case non-boundary commas exist inside the cell:

(?<!,\s+)"(?!\s+,$)

I'm not sure if it would have any bugs though.

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336108

Your idea of looking for quotes next to a comma is probably the best thing you can do; you could however try to turn it around and have the regex escape all the quotes that are not next to a comma (or start/end of line):

Search for

(?<!^|,)"(?!,|$)

and replace all the matches with "".

R might not be the best tool for this because its regex engine doesn't have a multiline mode, but in Perl it would be a one-liner:

$subject =~ s/(?<!^|,)"(?!,|$)/""/mg;

Upvotes: 2

Related Questions