Reputation: 798
This is more of a request for explanation of how the csv module really understands data coming in from a file. I don't quite understand a quirk I've come across a few times tied specifically to new line characters.
With the following code:
with open(filename, 'rbU') as inCsvFile:
for row in csv.reader(inCsvFile):
print repr(row)
I get the following lines printed
['ID', 'AFTADM', 'AFTORD', 'AFTREM', 'AFTOTH', 'AFTOTX', 'COMM']
['3', '2', '2', '2', '2', '', 'New, had to look up']
['3', '2', '2', '2', '2', '', 'Percieved complexity']
['3', '2', '2', '2', '2', '', 'Stress per se']
['3', '2', '2', '2', '2', '', 'no records']
['10', '2', '1', '2', '2', '', 'Please ignore the crossouts!\nMix-up, very late start\nCould not get database to work -- unable to view data']
['10', '2', '2', '2', '2', '', '']
['10', '2', '2', '2', '2', '', '']
['10', '1', '1', '2', '2', '', '']
['10', '2', '2', '2', '2', '', '']
['10', '2', '2', '2', '2', '', '']
['11', '1', '1', '1', '1', 'viewed results', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['12', '2', '1', '2', '2', '', '']
['12', '', '', '', '', '', '']
['12', '2', '2', '2', '1', 'emailed', '']
...
Specifically the row I'm interested in is:
['10', '2', '1', '2', '2', '', 'Please ignore the crossouts!\nMix-up, very late start\nCould not get database to work -- unable to view data']
If I try to handle this without the csv module using code such as:
with open(filename, 'rbU') as inFile:
for row in inFile:
print repr(row)
This treats it as a text stream and I wind up with that one row turning into three:
'10,2,1,2,2,,"Please ignore the crossouts!\n'
'Mix-up, very late start\n'
'Could not get database to work -- unable to view data"\n'
I know it's the newline characters doing it, but how does the csv module understand end of row and not get tripped up by the newline characters?
Is there specific code I could use (without reinventing the csv module) that would allow me to deal with this sort of scenario without using the csv.reader
?
Contents of file for anyone who wants to see the data file I'm discussing: https://dl.dropboxusercontent.com/u/19141650/test.csv
Upvotes: 2
Views: 1291
Reputation: 1123620
The column is quoted. There column starts with a "
and ends with another. Everything between those two quotes is recognized as a column:
"Please ignore the crossouts!\n
Mix-up, very late start\n
Could not get database to work -- unable to view data"
When writing CSV files, the module will also use quoting where needed, unless you tell it to do otherwise with the quoting
argument; you can set it to any of the csv.QUOTE_ALL
, csv.QUOTE_MINIMAL
, csv.QUOTE_NONNUMERIC
or csv.QUOTE_NONE
constants:
>>> import csv
>>> import sys
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_ALL)
>>> writer.writerow([5, 'spam', 'foo\nbar'])
"5","spam","foo
bar"
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_MINIMAL)
>>> writer.writerow([5, 'spam', 'foo\nbar'])
5,spam,"foo
bar"
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_NONNUMERIC)
>>> writer.writerow([5, 'spam', 'foo\nbar'])
5,"spam","foo
bar"
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_NONE)
>>> writer.writerow([5, 'spam', 'foo\nbar'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: need to escape, but no escapechar set
Note how the module actually issues an error when disabling quoting and trying to write a value with a newline.
The same constants apply when reading; the default is taken from the dialect chosen, and the default Excel dialect has quoting
set to csv.QUOTE_MINIMAL
.
Upvotes: 3
Reputation: 308422
The csv
module interprets everything between quote marks "
as part of the string, even newlines.
Upvotes: 1