Reputation: 41
I'm trying to load the following test.csv file:
R1C1 R1C2 R1C3
R2C1 R2C2 R2C3
R3C1 "R3C2 R3C3
R4C1 R4C2 R4C3
... Using this Python script :
import csv
with open("test.csv") as f:
for row in csv.reader(f, delimiter='\t'):
print(row)
The result I got was the following :
['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2\tR3C3\nR4C1\tR4C2\tR4C3\n']
It turns out that when Python finds a field whose first character is a quotation mark and there is no closing quotation mark, it will include all of the following content as part of the same field.
My question: What is the best approach for all rows in the file to be read properly? Please consider I'm using Python 3.8.5 and the script should be able to read huge files (2gb or more), so memory usage and performance issues should be also considered.
Upvotes: 0
Views: 461
Reputation: 8952
Honestly, if you're dealing with that much data, it'd be best to go in and clean it first. And if possible, fix whatever process is producing your bad data in the first place.
I haven't tested with a large file, but you may just be able to replace "
characters as you read lines, assuming there's never a case where they're valid characters:
import csv
with open("test.csv") as f:
line_generator = (line.replace('"', '') for line in f)
for row in csv.reader(line_generator, delimiter='\t'):
print(row)
Output:
['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2', 'R3C3']
['R4C1', 'R4C2', 'R4C3']
Upvotes: 1