Mew
Mew

Reputation: 415

CSV line continuation character to ignore newlines

I'm using Python to parse a .csv file that contains line breaks in most values. This isn't an issue, since values are delimited by ".

However, I've noticed that during the construction of the .csv file at one point in time, long values were split into multiple lines (but kept within the same value), with an = character put at the end of one line to signify "the following line break is actually a concatenation". A minimal working example: the value

Hello, world!
How are you today?

could be represented as

"Hello, world!\n
How are you t=\n
oday?"

where \n denotes the one-byte line break character.

Does CSV have the concept of "line continuation characters"? The documentation of Python's csv library does not mention anything about it under the formatting section, and hence I wonder if this is common practice and if Python nevertheless has support. I know how to write a parser that concatenates these lines (a simple str.replace(v,"=\n","") probably suffices), but I'm just curious whether this is an idiosyncrasy of my file.

Upvotes: 1

Views: 183

Answers (1)

Mew
Mew

Reputation: 415

This seems to be not a feature of CSV, but rather of MIME (and since my dataset consists of e-mails, this solves my question).

This usage of equals characters is part of quoted-printable encoding, and can be handled by the quopri Python module. See this answer for more details.

Using this module is better than a simple str.replace(v, "=\n", ""), because e-mails can contain other quoted-printable tokens that need decoding and do not appear on line ends (e.g. =09 to represent a horizontal tab). With quopri, you would write:

import quopri
v = ...
original = quopri.decodestring(v.encode("utf-8")).decode("utf-8")

Upvotes: 2

Related Questions