stoves
stoves

Reputation: 798

How does python csv module understand end of line vs. newline

This is more of a request for explanation of how the csv module really understands data coming in from a file. I don't quite understand a quirk I've come across a few times tied specifically to new line characters.

With the following code:

with open(filename, 'rbU') as inCsvFile:
    for row in csv.reader(inCsvFile):
        print repr(row)

I get the following lines printed

['ID', 'AFTADM', 'AFTORD', 'AFTREM', 'AFTOTH', 'AFTOTX', 'COMM']
['3', '2', '2', '2', '2', '', 'New, had to look up']
['3', '2', '2', '2', '2', '', 'Percieved  complexity']
['3', '2', '2', '2', '2', '', 'Stress per se']
['3', '2', '2', '2', '2', '', 'no records']
['10', '2', '1', '2', '2', '', 'Please ignore the crossouts!\nMix-up, very late start\nCould not get database to work -- unable to view  data']
['10', '2', '2', '2', '2', '', '']
['10', '2', '2', '2', '2', '', '']
['10', '1', '1', '2', '2', '', '']
['10', '2', '2', '2', '2', '', '']
['10', '2', '2', '2', '2', '', '']
['11', '1', '1', '1', '1', 'viewed  results', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['11', '2', '2', '2', '2', '', '']
['12', '2', '1', '2', '2', '', '']
['12', '', '', '', '', '', '']
['12', '2', '2', '2', '1', 'emailed', '']
...

Specifically the row I'm interested in is:

['10', '2', '1', '2', '2', '', 'Please ignore the crossouts!\nMix-up, very late start\nCould not get database to work -- unable to view  data']

If I try to handle this without the csv module using code such as:

with open(filename, 'rbU') as inFile:
    for row in inFile:
        print repr(row)

This treats it as a text stream and I wind up with that one row turning into three:

'10,2,1,2,2,,"Please ignore the crossouts!\n'
'Mix-up, very late start\n'
'Could not get database to work -- unable to view  data"\n'

I know it's the newline characters doing it, but how does the csv module understand end of row and not get tripped up by the newline characters?

Is there specific code I could use (without reinventing the csv module) that would allow me to deal with this sort of scenario without using the csv.reader?

Contents of file for anyone who wants to see the data file I'm discussing: https://dl.dropboxusercontent.com/u/19141650/test.csv

Upvotes: 2

Views: 1291

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1123620

The column is quoted. There column starts with a " and ends with another. Everything between those two quotes is recognized as a column:

"Please ignore the crossouts!\n
Mix-up, very late start\n
Could not get database to work -- unable to view  data"

When writing CSV files, the module will also use quoting where needed, unless you tell it to do otherwise with the quoting argument; you can set it to any of the csv.QUOTE_ALL, csv.QUOTE_MINIMAL, csv.QUOTE_NONNUMERIC or csv.QUOTE_NONE constants:

>>> import csv
>>> import sys
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_ALL)
>>> writer.writerow([5, 'spam', 'foo\nbar'])
"5","spam","foo
bar"
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_MINIMAL)
>>> writer.writerow([5, 'spam', 'foo\nbar'])
5,spam,"foo
bar"
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_NONNUMERIC)
>>> writer.writerow([5, 'spam', 'foo\nbar'])
5,"spam","foo
bar"
>>> writer = csv.writer(sys.stdout, quoting=csv.QUOTE_NONE)
>>> writer.writerow([5, 'spam', 'foo\nbar'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_csv.Error: need to escape, but no escapechar set

Note how the module actually issues an error when disabling quoting and trying to write a value with a newline.

The same constants apply when reading; the default is taken from the dialect chosen, and the default Excel dialect has quoting set to csv.QUOTE_MINIMAL.

Upvotes: 3

Mark Ransom
Mark Ransom

Reputation: 308422

The csv module interprets everything between quote marks " as part of the string, even newlines.

Upvotes: 1

Related Questions