0xCAFEBABE
0xCAFEBABE

Reputation: 5666

Parsing CSV string with CSV module

I'm getting CSV formatted data piped from an external source, and the data should not be written to a file, that would open a number of different maintenance tasks I would like to avoid. I am getting the data as a string.

So now I want to interpret the data as CSV. The python module csv is suited for that, so I am using it.

However, constructing a CSV parser using csv.reader (or dictreader) requires the data to support an iterator. A list or a reader would already do that. However, my data contains line breaks (\n) inside quotes strings (which is totally fine for the CSV format), and also line breaks as record end indicators/line terminators.

If I construct it like this:

csvreader = csv.reader(data.split('\n'))

then I have already broken the valid CSV formatting by interpreting the line breaks as record end indicators, even if they were inside quoted strings. For me, the interpretation of the line breaks should be done by the CSV parser, as it is aware of quotes fields containing CSV relevant formatting characters like line breaks.

Do I have a misunderstanding with the way the csv package parser works? Or how would I parse a CSV type string using the CSV parser while the data contains line breaks that do not indicate the end of a record?

Edit 1: It might not have been wise to add context to my question, as the context seems to have overtaken my actual question/problem. To clarify, my problem is not parsing CSV from memory. I have found StringIO already. When using StringIO, all that does is use the whole string/file as the first record.

Edit 2: Maybe sample data will help. My data looks like this:

"value1";"value2";"value3
stillvalue3"
"value4";"value5";"value6"

So inside the 3rd entry in the first record I have a line break character, which should be interpreted as part of the third entry in the first record.

Edit 3: The way I currently use the parser is this:

csvreader = csv.reader(StringIO(result), quotechar='"', delimiter=';', lineterminator='\n', escapechar='"')

because the spec says that a double quote (") inside a field is escaped by a quote char (""), so the data might look like this:

"value1";"value2";"value3
stillvalue3
a ""quote"" inside the quote"
"value4";"value5";"value6"

It seems that using these settings the StringIO input always results in one record with all the data from the string inside it. Removing the

escapechar='"'

fixes this problem, however, I cannot support escaped double quote characters inside the quotes. So the nature of my question changed. I'll keep this open if anybody has an idea and will accept the answer of Martin Evans later.

Upvotes: 7

Views: 2890

Answers (1)

Martin Evans
Martin Evans

Reputation: 46759

You can use Python's StringIO() to give your data string a stream interface.

Do not attempt to use data.split('\n') as this will fail to split the lines correctly, the csv.reader() will correctly read the lines in, even if cells contain newlines:

from io import StringIO
import csv

data = '''"value1";"value2";"value3
stillvalue3"
"value4";"value5";"value6"'''

csv_input = csv.reader(StringIO(data, newline=''), delimiter=';')
rows = list(csv_input)
print("Rows", rows)    

Giving you:

Rows [['value1', 'value2', 'value3\nstillvalue3'], ['value4', 'value5', 'value6']]

This could then be loaded into Pandas, or directly using StringIO():

columns = ['col1', 'col2', 'col3']

df = pd.DataFrame(rows, columns=columns)

df = pd.read_csv(StringIO(data, newline=''), names=columns, sep=';')

print(df)

Giving:

     col1    col2                 col3
0  value1  value2  value3\nstillvalue3
1  value4  value5               value6

Your second example gives:

Rows [['value1', 'value2', 'value3\nstillvalue3\na "quote" inside the quote'], ['value4', 'value5', 'value6']]

     col1    col2                                             col3
0  value1  value2  value3\nstillvalue3\na "quote" inside the quote
1  value4  value5                                           value6

Upvotes: 8

Related Questions