python tab separated file parsing problems

Question

From mysql I am generating a tab-separated output file using outfile. I then use python to load the tsv and process it. I feel like I'm missing something, but I cannot figure out how to get csv.reader to accept data where quoted fields can contain tabs, newlines, carriage returns, etc. The csv.reader keeps breaking the rows on all newline characters, not just the newline characters outside of my quoted fields.

Settings:

with open('/path/to/file.tsv', 'rbU') as f:
    reader = csv.reader(
        f,
        delimiter='	',
        lineterminator='
',
        quoting=csv.QUOTE_ALL
    )
    for line in reader:
        #  do something

Example:

In the example below, is an actual carriage return, is an actual newline, and \N is what mysql is outputting for a null value.

"4256996"   "test@gmail.com"    "Y  "   "98230
"   "2012-07-10T12:00:00"   "some  location"    \N  \N  "false" "aaa"   "another-field" "true"  1

The resulting output:

['4256996', 'test@gmail.com', 'Y	', '98230'], ['2012-07-10T12:00:00', 'some  location', '\N', '\N', 'false', 'aaa', 'another-field', 'true', '1']

Is there a way to get the csv.reader to read this input data properly, or is this some sort of limitation with the csv.reader object?

Note: If you try to replicate this, make sure you replace with an actual carriage return, with an actual newline, etc.

Martijn Pieters · Accepted Answer

You need to open your file in binary mode only. By adding in 'U' (universal newline mode) you are instead instructing Python to replace any with .

with open('/path/to/file.tsv', 'rb') as f:

Once reading just binary data your sample input works:

>>> import csv
>>> from io import BytesIO
>>> sample = BytesIO('''\
... "4256996"	"test@gmail.com"	"Y  "	"98230
"	"2012-07-10T12:00:00"	"some  location"	\N	\N	"false"	"aaa"	"another-field"	"true"	1
''')
>>> sample.readline()
'"4256996"	"test@gmail.com"	"Y  "	"98230
"	"2012-07-10T12:00:00"	"some  location"	\N	\N	"false"	"aaa"	"another-field"	"true"	1
'
>>> sample.seek(0)
0L
>>> reader = csv.reader(sample, delimiter='	',
...         lineterminator='
',
...         quoting=csv.QUOTE_ALL
...     )
>>> next(reader)
['4256996', 'test@gmail.com', 'Y  ', '98230
', '2012-07-10T12:00:00', 'some  location', '\N', '\N', 'false', 'aaa', 'another-field', 'true', '1']

To illustrate, reading a line with the U mode set Python reads the data incorrectly:

>>> sample.seek(0)
0L
>>> open('/tmp/test.csv', 'wb').write(sample.read())
>>> f = open('/tmp/test.csv', 'rbU')
>>> f.readline()
'"4256996"	"test@gmail.com"	"Y  "	"98230
'
>>> f = open('/tmp/test.csv', 'rb')
>>> f.readline()
'"4256996"	"test@gmail.com"	"Y  "	"98230
"	"2012-07-10T12:00:00"	"some  location"	\N	\N	"false"	"aaa"	"another-field"	"true"	1
'

python tab separated file parsing problems

Settings:

Example:

Answers (1)

Related Questions