Reputation: 2044
I am trying to import csv data from files where sometimes the enclosing char " is missing.
So I have rows like this:
"ThinkPad";"2000.00";"EUR"
"MacBookPro";"2200.00;EUR"
# In the second row the closing " after 2200.00 is missing
# also the closing " before EUR" is missing
Now I am reading the csv data with this:
csv.reader(
codecs.open(filename, 'r', encoding='latin-1'),
delimiter=";",
dialect=csv.excel_tab)
And the data I get for the second row is this:
["MacBookPro", "2200.00;EUR"]
Aside from pre-processing my csv files with a unix command like sed and removing all closing chars " and relying on the semicolon to seperate the columns, what else can I do?
Upvotes: 0
Views: 1746
Reputation: 51877
This might work:
import csv
import io
file = io.StringIO(u'''
"ThinkPad";"2000.00";"EUR"
"MacBookPro";"2200.00;EUR"
'''.strip())
reader = csv.reader((line.replace('"', '') for line in file), delimiter=';', quotechar='"')
for row in reader:
print(row)
The problem is that if there are any legitimate quoted line, e.g.
"MacBookPro;Awesome Edition";"2200.00";"EUR"
Or, worse:
"MacBookPro:
Description: Awesome Edition";"2200.00";"EUR"
Your output is going to produce too few/many columns. But if you know that's not a problem then it will work fine. You could pre-screen the file by adding this before the read part, which would give you the malformed line:
for line in file:
if line.count(';') != 2:
raise ValueError('No! This file has broken data on line {!r}'.format(line))
file.seek(0)
Or alternatively you could screen as you're reading:
for row in reader:
if any(';' in _ for _ in row):
print('Error:')
print(row)
Ultimately your best option is to fix whatever is producing your garbage csv file.
Upvotes: 2
Reputation: 1875
If you're looping through all the lines/rows of the file, you can use string's .replace() function to get rid off the quotes (if you don't need them later-on for other purposes.).
>>> import csv
>>> with open('eggs.csv', 'rb') as csvfile:
... my_file = csv.reader(codecs.open(filename, 'r', encoding='latin-1')
... delimiter=";",
... dialect=csv.excel_tab)
... )
... for row in my_file:
... (model,price,currency) = row
... model.replace('"','')
... price.replace('"','')
... currency.replace('"','')v
... print 'Model is: %s (costs %s%s).' % (model,price,currency)
>>>
Model is: MacBookPro (costs 2200.00EUR).
Upvotes: 0