JesseBuesking
JesseBuesking

Reputation: 6586

python tab separated file parsing problems

From mysql I am generating a tab-separated output file using outfile. I then use python to load the tsv and process it. I feel like I'm missing something, but I cannot figure out how to get csv.reader to accept data where quoted fields can contain \t tabs, \n newlines, \r carriage returns, etc. The csv.reader keeps breaking the rows on all newline characters, not just the \n newline characters outside of my quoted fields.

Settings:

with open('/path/to/file.tsv', 'rbU') as f:
    reader = csv.reader(
        f,
        delimiter='\t',
        lineterminator='\n',
        quoting=csv.QUOTE_ALL
    )
    for line in reader:
        #  do something

Example:

In the example below, \r is an actual carriage return, \n is an actual newline, and \N is what mysql is outputting for a null value.

"4256996"   "[email protected]"    "Y  "   "98230\r"   "2012-07-10T12:00:00"   "some  location"    \N  \N  "false" "aaa"   "another-field" "true"  1

The resulting output:

['4256996', '[email protected]', 'Y\t', '98230'], ['2012-07-10T12:00:00', 'some  location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']

Is there a way to get the csv.reader to read this input data properly, or is this some sort of limitation with the csv.reader object?

Note: If you try to replicate this, make sure you replace \r with an actual carriage return, \n with an actual newline, etc.

Upvotes: 0

Views: 575

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121884

You need to open your file in binary mode only. By adding in 'U' (universal newline mode) you are instead instructing Python to replace any \r with \n.

with open('/path/to/file.tsv', 'rb') as f:

Once reading just binary data your sample input works:

>>> import csv
>>> from io import BytesIO
>>> sample = BytesIO('''\
... "4256996"\t"[email protected]"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n''')
>>> sample.readline()
'"4256996"\t"[email protected]"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'
>>> sample.seek(0)
0L
>>> reader = csv.reader(sample, delimiter='\t',
...         lineterminator='\n',
...         quoting=csv.QUOTE_ALL
...     )
>>> next(reader)
['4256996', '[email protected]', 'Y  ', '98230\r', '2012-07-10T12:00:00', 'some  location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']

To illustrate, reading a line with the U mode set Python reads the data incorrectly:

>>> sample.seek(0)
0L
>>> open('/tmp/test.csv', 'wb').write(sample.read())
>>> f = open('/tmp/test.csv', 'rbU')
>>> f.readline()
'"4256996"\t"[email protected]"\t"Y  "\t"98230\n'
>>> f = open('/tmp/test.csv', 'rb')
>>> f.readline()
'"4256996"\t"[email protected]"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'

Upvotes: 1

Related Questions