How does python csv module understand end of line vs. newline

Question

This is more of a request for explanation of how the csv module really understands data coming in from a file. I don't quite understand a quirk I've come across a few times tied specifically to new line characters.

With the following code:

with open(filename, 'rbU') as inCsvFile:
    for row in csv.reader(inCsvFile):
        print repr(row)

I get the following lines printed

['ID', 'AFTADM', 'AFTORD', 'AFTREM', 'AFTOTH', 'AFTOTX', 'COMM']
['3', '2', '2', '2', '2', '', 'New, had to look up']
['3', '2', '2', '2', '2', '', 'Percieved  complexity']
['3', '2', '2', '2', '2', '', 'Stress per se']
['3', '2', '2', '2', '2', '', 'no records']
['10', '2', '1', '2', '2', '', 'Please ignore the crossouts!
Mix-up, very late start
Could not get database to work -- unable to view  data']
['10', '2', '2', '2', '2', '', '']
['10', '2', '2', '2', '2', '', '']
['10', '1', '1', '2', '2', '', '']
['10', '2', '2', '2', '2', '', '']
['10', '2', '2', '2', '2', '', '']
['11', '1', '1', '1', '1', 'viewed  results', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['12', '2', '1', '2', '2', '', '']
['12', '', '', '', '', '', '']
['12', '2', '2', '2', '1', 'emailed', '']
...

Specifically the row I'm interested in is:

['10', '2', '1', '2', '2', '', 'Please ignore the crossouts!
Mix-up, very late start
Could not get database to work -- unable to view  data']

If I try to handle this without the csv module using code such as:

with open(filename, 'rbU') as inFile:
    for row in inFile:
        print repr(row)

This treats it as a text stream and I wind up with that one row turning into three:

'10,2,1,2,2,,"Please ignore the crossouts!
'
'Mix-up, very late start
'
'Could not get database to work -- unable to view  data"
'

I know it's the newline characters doing it, but how does the csv module understand end of row and not get tripped up by the newline characters?

Is there specific code I could use (without reinventing the csv module) that would allow me to deal with this sort of scenario without using the csv.reader?

Contents of file for anyone who wants to see the data file I'm discussing: https://dl.dropboxusercontent.com/u/19141650/test.csv

Martijn Pieters · Accepted Answer

The column is quoted. There column starts with a " and ends with another. Everything between those two quotes is recognized as a column:

"Please ignore the crossouts!

Mix-up, very late start

Could not get database to work -- unable to view  data"

When writing CSV files, the module will also use quoting where needed, unless you tell it to do otherwise with the quoting argument; you can set it to any of the csv.QUOTE_ALL, csv.QUOTE_MINIMAL, csv.QUOTE_NONNUMERIC or csv.QUOTE_NONE constants:

>>> import csv
>>> import sys
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_ALL)
>>> writer.writerow([5, 'spam', 'foo
bar'])
"5","spam","foo
bar"
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_MINIMAL)
>>> writer.writerow([5, 'spam', 'foo
bar'])
5,spam,"foo
bar"
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_NONNUMERIC)
>>> writer.writerow([5, 'spam', 'foo
bar'])
5,"spam","foo
bar"
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_NONE)
>>> writer.writerow([5, 'spam', 'foo
bar'])
Traceback (most recent call last):
  File "", line 1, in 
_csv.Error: need to escape, but no escapechar set

Note how the module actually issues an error when disabling quoting and trying to write a value with a newline.

The same constants apply when reading; the default is taken from the dialect chosen, and the default Excel dialect has quoting set to csv.QUOTE_MINIMAL.

How does python csv module understand end of line vs. newline

Answers (2)

Related Questions