Reputation: 5666
I'm getting CSV formatted data piped from an external source, and the data should not be written to a file, that would open a number of different maintenance tasks I would like to avoid. I am getting the data as a string.
So now I want to interpret the data as CSV. The python module csv
is suited for that, so I am using it.
However, constructing a CSV parser using csv.reader
(or dictreader
) requires the data to support an iterator. A list or a reader would already do that. However, my data contains line breaks (\n
) inside quotes strings (which is totally fine for the CSV format), and also line breaks as record end indicators/line terminators.
If I construct it like this:
csvreader = csv.reader(data.split('\n'))
then I have already broken the valid CSV formatting by interpreting the line breaks as record end indicators, even if they were inside quoted strings. For me, the interpretation of the line breaks should be done by the CSV parser, as it is aware of quotes fields containing CSV relevant formatting characters like line breaks.
Do I have a misunderstanding with the way the csv
package parser works? Or how would I parse a CSV type string using the CSV parser while the data contains line breaks that do not indicate the end of a record?
Edit 1: It might not have been wise to add context to my question, as the context seems to have overtaken my actual question/problem. To clarify, my problem is not parsing CSV from memory. I have found StringIO
already. When using StringIO
, all that does is use the whole string/file as the first record.
Edit 2: Maybe sample data will help. My data looks like this:
"value1";"value2";"value3
stillvalue3"
"value4";"value5";"value6"
So inside the 3rd entry in the first record I have a line break character, which should be interpreted as part of the third entry in the first record.
Edit 3: The way I currently use the parser is this:
csvreader = csv.reader(StringIO(result), quotechar='"', delimiter=';', lineterminator='\n', escapechar='"')
because the spec says that a double quote (") inside a field is escaped by a quote char (""), so the data might look like this:
"value1";"value2";"value3
stillvalue3
a ""quote"" inside the quote"
"value4";"value5";"value6"
It seems that using these settings the StringIO input always results in one record with all the data from the string inside it. Removing the
escapechar='"'
fixes this problem, however, I cannot support escaped double quote characters inside the quotes. So the nature of my question changed. I'll keep this open if anybody has an idea and will accept the answer of Martin Evans later.
Upvotes: 7
Views: 2890
Reputation: 46759
You can use Python's StringIO()
to give your data string a stream interface.
Do not attempt to use data.split('\n')
as this will fail to split the lines correctly, the csv.reader()
will correctly read the lines in, even if cells contain newlines:
from io import StringIO
import csv
data = '''"value1";"value2";"value3
stillvalue3"
"value4";"value5";"value6"'''
csv_input = csv.reader(StringIO(data, newline=''), delimiter=';')
rows = list(csv_input)
print("Rows", rows)
Giving you:
Rows [['value1', 'value2', 'value3\nstillvalue3'], ['value4', 'value5', 'value6']]
This could then be loaded into Pandas, or directly using StringIO()
:
columns = ['col1', 'col2', 'col3']
df = pd.DataFrame(rows, columns=columns)
df = pd.read_csv(StringIO(data, newline=''), names=columns, sep=';')
print(df)
Giving:
col1 col2 col3
0 value1 value2 value3\nstillvalue3
1 value4 value5 value6
Your second example gives:
Rows [['value1', 'value2', 'value3\nstillvalue3\na "quote" inside the quote'], ['value4', 'value5', 'value6']]
col1 col2 col3
0 value1 value2 value3\nstillvalue3\na "quote" inside the quote
1 value4 value5 value6
Upvotes: 8