Jay Gattuso
Jay Gattuso

Reputation: 4130

CSV files with quote and comma chars inside fields

I have a stack of CSV files I want to parse - the problem is half of the have quote marks used as quote marks, and commas inside main field. They are not really CSV, but they do have a fixed number of fields that are identifiable. The dialect=csv."excel" setting works perfectly on files with out the extra " and , chars inside the field.

This data is old/unsupported. I am trying to push some life into it.

e.g.

"AAAAA
AAAA
AAAA
AAAA","AAAAAAAA


AAAAAA
AAAAA "AAAAAA" AAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA, AAAAA
AAAAAAAAA AAAAA AAAAAAAAAA
AAAAA, "AAAAA", AAAAAAAAA
AAAAAAAA AAAAAAAA
AAAAAAA
"

This is tripping the file parser, and throws an error _csv.Error: newline inside string. I narrrowed it down to this being the issue by removing the quote marks from inside the 2nd field and the csv.reader module parses the file OK.

Some of the fields are multi line - I'm not sure if thats important to know.

I have been poking around at the dialect settings, and whilst I can find 'skipinitialspace', this doesn't seem to solve the problem.

To be clear - this is not valid 'CSV', its data objects that loosely follow a CSV structure, but have , and " chars inside the field test.

The lineterminator is \x0d\x0a

I have tried a number of goes at differnt permuations of doublequote and the quoting variable in the dialect module, but I can't get this parse correctly.

I can not be confident that a ," or ", combination exists only on field boundaries.

This problem only exists for one (the last) of several fields in the file, and there are several thousand files.

Upvotes: 8

Views: 28550

Answers (3)

Joshua
Joshua

Reputation: 26732

I would write a converter that would parse the initial csv and output a valid one. You can possibly use the ", or "\n as the mechanism for determining the the delimiting.

Upvotes: 2

Herbie
Herbie

Reputation: 141

I'm not allowed to comment quite yet, so I'll post as an answer...

Assuming you are using commas as your delimiter, are there any commas within your data? If not, then you could do a massive find and replace to double all the quote characters after the first and before the last characters of the field prior to CSV processing.

Upvotes: 4

senderle
senderle

Reputation: 150977

Have you tried passing csv.QUOTE_NONE via the quoting keyword arg? Without having some code or data to test this on, I have no way to know whether this actually works on your data, but it seems to work with the fragment you provided.

>>> import csv
>>> r = csv.reader(open('foo.csv', 'rb'), quoting=csv.QUOTE_NONE)
>>> for row in r: print row
... 
['"A"', '"B"', '"ccc "ccccccc" cccccc"']

Upvotes: 11

Related Questions