Reputation: 4385
Python's csv
module is hard-coded (in C) to immediately recognize a carriage return/line feed as an end of a row when using the reader.
In many cases, I've had to write a script (in Python, because tools like sed
do not treat newlines/carriage returns normally) to replace in-cell newlines byte-by-byte with an unused text character (vertical tab) and then replace that when parsing it again with a csv reader.
There are two primary cases I have had to deal with:
The true "end-of-line" was always indicated by \r\n
and in-cell newlines were simply \n
The end-of-lines were all \n
, except when within a quoted field. (e.g., val1,"first line of cell\nsecond line of cell",val3
)
The files I am dealing with are too large to efficiently process in memory, so I would like to know if there's an alternative parser that doesn't automatically terminate a row after a carriage return or newline is encountered.
Upvotes: 4
Views: 2630
Reputation: 5603
It's not the csv
module that's getting your line endings wrong - it's the file
object. The csv reader just treats its first argument as an iterator. The file
object treats all styles of line endings as line terminators by default. For Python 3, you can solve your issue by opening your csv file with the argument newline='\r\n'
:
reader = csv.DictReader(open('weird.csv', newline='\r\n'))
With Python 2, you can use the io
module:
import io
reader = csv.DictReader(io.open('weird.csv', newline='\r\n'))
Note that both these approaches count on the newline-containing field being properly quoted. If it's not, you can try wrapping your open file (with either approach above) with a generator that quotes the newline-containing fields, like so:
import re
def quote(fh, delimiter='\t'):
for line in fh:
# using line.strip('\r\n') could improperly remove a
# leading or trailing \n character
fields = re.sub('\r\n$', '', line).split(delimiter)
quoted = []
for field in fields:
if '\n' in field and not re.match('".*"$', field, re.MULTILINE):
# must quote any existing quotes!
field = '"{}"'.format(field.replace('"', '""')
quoted.append(field)
yield delimiter.join(quoted) + '\r\n'
reader = csv.DictReader(quote(io.open('weird.csv', newline='\r\n')), delimiter='\t')
Upvotes: 1