Reputation: 1709
I have some lines in a CSV file like this:
1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2
If you notice, some numbers are enclosed in " " and has a thousand separator ",". I want to remove the thousand separator and the double quote enclosure. For the qoute enclosure, I'm thinking of using string.replace() but how about the comma inside the quote marks?
What's the best way of doing this in Python?
Upvotes: 1
Views: 3097
Reputation: 96702
You should absolutely use the csv
module. If you use a csv.reader
, you only have one very small problem: testing fields to see if they're numbers, and stripping commas if they are. I've packaged it as a generator:
import csv
def read_and_fix_numbers(f):
"""Iterate over a file object that returns CSV data, stripping commas out of numbers."""
for row in csv.reader(f):
for field in row:
try:
x = float(field)
field.replace(",", "")
except ValueError:
pass
fixed.append(field)
yield fixed
Usage:
>>> data = '1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2'
>>> import StringIO
>>> f = StringIO.StringIO(data)
>>> for row in read_and_fix_numbers(f):
print row
['1000001234', 'Account Name', '0', '0', '3711.32', '0', '0', '18629.64', '22340.96', 'COD', '20000.00', 'Some string', 'Some string 2']
Upvotes: 1
Reputation: 3231
Use the csv module. It has all sorts of constants and parameters to help you set the delimiters, quotes, and everything else for the type of file you are working with. It even has a Sniffer that can help you identify the csv format of the file. In fact this is the only module I have found that can properly and easily work with csv files.
http://docs.python.org/library/csv.html
Upvotes: 1
Reputation: 123821
Here is something I just tested, you may not need pprint, I just want to use for clear output.
test.csv
1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2
1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2
Code, use csv reader, and pass each item to parseNum function to check valid digit or not.
from pprint import pprint
import csv
def parseNum(x):
xx=x.replace(",","")
if not xx.replace(".","").isdigit(): return x
return "." in xx and float(xx) or int(xx)
x=[map(parseNum,line) for line in csv.reader(open("test.csv"))]
pprint(x)
Output
[[1000001234,
'Account Name',
0,
0,
3711.3200000000002,
0,
0,
18629.639999999999,
22340.959999999999,
'COD',
20000.0,
'Some string',
'Some string 2'],
[1000001234,
'Account Name',
0,
0,
3711.3200000000002,
0,
0,
18629.639999999999,
22340.959999999999,
'COD',
20000.0,
'Some string',
'Some string 2']]
Note: If you need good precision on float numbers, replace float with Decimal
Upvotes: 1
Reputation: 885
Here is a bit of regular expression fiddling that will do the trick:
>>> import re
>>> p = re.compile('["]([^"]*)["]')
>>> x = """1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2"""
>>> p.sub(lambda m: m.groups()[0].replace(',',''), x)
'1000001234,Account Name,0,0,3711.32,0,0,18629.64,22340.96,COD,20000.00,Some string,Some string 2'
Removes the commas from the parts of the string that is between pairs of quotes.
Upvotes: 2
Reputation: 881575
If all you want is to remove double quotes and commas from a string, a couple of replaces will do it:
s = s.replace('"','').replace(',','')
A faster way is to use s.translate
, but that requires a minimum of preparation:
import string
identity = string.maketrans('', '')
...
s = s.translate(identity, '",')
This removes any occurrence of double quotes or commas, and does it pretty fast too. In general, the .translate
method of string objects is the best way to remove certain kinds of characters from a string (as well as possibly performing some character-to-character translation, but, by using a translate table such as the identity
one I show here, the translation part may in fact be easily bypassed). Note that .translate
works a bit differently for Unicode objects (and therefore for Python 3 strings, too) -- I'm giving the approach that's suitable for plain Python 2 string objects.
Upvotes: 1
Reputation: 3446
You could simply parse the CSV, make the necessary changes and then write it again.
(I haven't tested this code but it should be something like this)
import csv
reader = csv.reader(open('IN.csv', 'r'))
writer = csv.writer(open('OUT.csv', 'w')
for row in reader:
# do stuff to the row here
# row is just a list of items
writer.writerow(row)
Upvotes: 2