Reputation: 6586
From mysql I am generating a tab-separated output file using outfile. I then use python to load the tsv and process it. I feel like I'm missing something, but I cannot figure out how to get csv.reader
to accept data where quoted fields can contain \t
tabs, \n
newlines, \r
carriage returns, etc. The csv.reader
keeps breaking the rows on all newline characters, not just the \n
newline characters outside of my quoted fields.
with open('/path/to/file.tsv', 'rbU') as f:
reader = csv.reader(
f,
delimiter='\t',
lineterminator='\n',
quoting=csv.QUOTE_ALL
)
for line in reader:
# do something
In the example below, \r
is an actual carriage return, \n
is an actual newline, and \N
is what mysql is outputting for a null value.
"4256996" "[email protected]" "Y " "98230\r" "2012-07-10T12:00:00" "some location" \N \N "false" "aaa" "another-field" "true" 1
The resulting output:
['4256996', '[email protected]', 'Y\t', '98230'], ['2012-07-10T12:00:00', 'some location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']
Is there a way to get the csv.reader
to read this input data properly, or is this some sort of limitation with the csv.reader
object?
Note: If you try to replicate this, make sure you replace \r
with an actual carriage return, \n
with an actual newline, etc.
Upvotes: 0
Views: 575
Reputation: 1121884
You need to open your file in binary mode only. By adding in 'U'
(universal newline mode) you are instead instructing Python to replace any \r
with \n
.
with open('/path/to/file.tsv', 'rb') as f:
Once reading just binary data your sample input works:
>>> import csv
>>> from io import BytesIO
>>> sample = BytesIO('''\
... "4256996"\t"[email protected]"\t"Y "\t"98230\r"\t"2012-07-10T12:00:00"\t"some location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n''')
>>> sample.readline()
'"4256996"\t"[email protected]"\t"Y "\t"98230\r"\t"2012-07-10T12:00:00"\t"some location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'
>>> sample.seek(0)
0L
>>> reader = csv.reader(sample, delimiter='\t',
... lineterminator='\n',
... quoting=csv.QUOTE_ALL
... )
>>> next(reader)
['4256996', '[email protected]', 'Y ', '98230\r', '2012-07-10T12:00:00', 'some location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']
To illustrate, reading a line with the U
mode set Python reads the data incorrectly:
>>> sample.seek(0)
0L
>>> open('/tmp/test.csv', 'wb').write(sample.read())
>>> f = open('/tmp/test.csv', 'rbU')
>>> f.readline()
'"4256996"\t"[email protected]"\t"Y "\t"98230\n'
>>> f = open('/tmp/test.csv', 'rb')
>>> f.readline()
'"4256996"\t"[email protected]"\t"Y "\t"98230\r"\t"2012-07-10T12:00:00"\t"some location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'
Upvotes: 1