Max Candocia
Max Candocia

Reputation: 4385

Are there any alternatives to the Python csv module that can quote newlines and use arbitrary characters as newlines for reading?

Python's csv module is hard-coded (in C) to immediately recognize a carriage return/line feed as an end of a row when using the reader.

In many cases, I've had to write a script (in Python, because tools like sed do not treat newlines/carriage returns normally) to replace in-cell newlines byte-by-byte with an unused text character (vertical tab) and then replace that when parsing it again with a csv reader.

There are two primary cases I have had to deal with:

The files I am dealing with are too large to efficiently process in memory, so I would like to know if there's an alternative parser that doesn't automatically terminate a row after a carriage return or newline is encountered.

Upvotes: 4

Views: 2630

Answers (1)

Nathan Vērzemnieks
Nathan Vērzemnieks

Reputation: 5603

It's not the csv module that's getting your line endings wrong - it's the file object. The csv reader just treats its first argument as an iterator. The file object treats all styles of line endings as line terminators by default. For Python 3, you can solve your issue by opening your csv file with the argument newline='\r\n':

reader = csv.DictReader(open('weird.csv', newline='\r\n'))

With Python 2, you can use the io module:

import io
reader = csv.DictReader(io.open('weird.csv', newline='\r\n'))

Note that both these approaches count on the newline-containing field being properly quoted. If it's not, you can try wrapping your open file (with either approach above) with a generator that quotes the newline-containing fields, like so:

import re
def quote(fh, delimiter='\t'):
    for line in fh:
        # using line.strip('\r\n') could improperly remove a
        # leading or trailing \n character
        fields = re.sub('\r\n$', '', line).split(delimiter)
        quoted = []
        for field in fields:
            if '\n' in field and not re.match('".*"$', field, re.MULTILINE):
                # must quote any existing quotes!
                field = '"{}"'.format(field.replace('"', '""')
            quoted.append(field)
        yield delimiter.join(quoted) + '\r\n'

reader = csv.DictReader(quote(io.open('weird.csv', newline='\r\n')), delimiter='\t')

Upvotes: 1

Related Questions