Reputation: 615
I have this read function where it reads a csv file using csv.DictReader. The file.csv is separated by commas and it fully reads. However, this part of my file has a column that contains multiple commas. My question is, how can I make sure that comma is counted as part of a column? I cannot alter my csv file to meet the criteria.
Text File:
ID,Name,University,Street,ZipCode,Country
12,Jon Snow,U of Winterfell,Winterfell #45,60434,Westeros
13,Steve Rogers,NYU,108, Chelsea St.,23333,United States
20,Peter Parker,Yale,34, Tribeca,32444,United States
34,Tyrion Lannister,U of Casterly Rock,Kings Landing #89, 43543,Westeros
The desired output is this:
{'ID': '12', 'Name': 'Jon Snow', 'University': 'U of Winterfell', 'Street': 'Winterfell #45', 'ZipCode': '60434', 'Country': 'Westeros'}
{'ID': '13', 'Name': 'Steve Rogers', 'University': 'NYU', 'Street': '108, Chelsea St.', 'ZipCode': '23333', 'Country': 'United States'}
{'ID': '20', 'Name': 'Peter Parker', 'University': 'Yale', 'Street': '34, Tribeca', 'ZipCode': '32444', 'Country': 'United States'}
{'ID': '34', 'Name': 'Tyrion Lannister', 'University': 'U of Casterly Rock', 'Street': 'Kings Landing #89', 'ZipCode': '43543', 'Country': 'Westeros'}
As you can tell the 'Street' has at least two commas due to the numbers:
13,Steve Rogers,NYU,108, Chelsea St.,23333,United States
20,Peter Parker,Yale,34, Tribeca,32444,United States
Note: Most of the columns being read splits by a str,str BUT under the 'Street' column it is followed by a str, str (there is an extra space after the comma). I hope this makes sense.
The options I tried looking out is using re.split, but I don't know how to implement it on my read file. I was thinking re.split(r'(?!\s),(?!\s)',x[:-1])
? How can I make sure the format from my file will count as part of any column? I can't use pandas.
My current output looks like this right now:
{'ID': '12', 'Name': 'Jon Snow', 'University': 'U of Winterfell', 'Street': 'Winterfell #45', 'ZipCode': '60434', 'Country': 'Westeros'}
{'ID': '13', 'Name': 'Steve Rogers', 'University': 'NYU', 'Street': '108', 'ZipCode': 'Chelsea St.', 'Country': '23333', None: ['United States']}
{'ID': '20', 'Name': 'Peter Parker', 'University': 'Yale', 'Street': '34', 'ZipCode': 'Tribeca', 'Country': '32444', None: ['United States']}
{'ID': '34', 'Name': 'Tyrion Lannister', 'University': 'U of Casterly Rock', 'Street': 'Kings Landing #89', 'ZipCode': '43543', 'Country': 'Westeros'}
This is my read function:
import csv
list = []
with open('file.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter=",", skipinitialspace=True)
for col in csv_reader:
list.append(dict(col))
print(dict(col))
Upvotes: 1
Views: 1686
Reputation: 364
The actual solution for the problem is modifying the script that generates the csv file.
If you have a chance to modify that output you can do 2 things
|
symbol or ;
whatever you believe it doesn't exist in the string. "
so you'll be able to split them by ,
which are actual separators. If you don't have a chance to modify the output.
And if you are sure about that multiple commas are only in the street column; then you should use csv.reader
instead of DictReader
this way you can get the columns by Indexes that you are already sure. for instance row[0]
will be ID
row[1]
will be Name
and row[-1]
will be Country
row[-2]
will be ZipCode
so row[2:-2]
would give you what you need i guess. Indexes can be arranged but the idea is clear I guess.
Hope that helps.
Edit:
import csv
list = []
with open('file.csv', mode='r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",", skipinitialspace=True)
# pass the header row
next(csv_reader)
for row in csv_reader:
list.append({"ID": row[0],
"Name": row[1],
"University": row[2],
"Street": ' '.join(row[3:-2]),
"Zipcode": row[-2],
"Country": row[-1]})
print(list)
-- Here is the output (with pprint)
[{'Country': 'Westeros',
'ID': '12',
'Name': 'Jon Snow',
'Street': 'Winterfell #45',
'University': 'U of Winterfell',
'Zipcode': '60434'},
{'Country': 'United States',
'ID': '13',
'Name': 'Steve Rogers',
'Street': '108 Chelsea St.',
'University': 'NYU',
'Zipcode': '23333'},
{'Country': 'United States',
'ID': '20',
'Name': 'Peter Parker',
'Street': '34 Tribeca',
'University': 'Yale',
'Zipcode': '32444'},
{'Country': 'Westeros',
'ID': '34',
'Name': 'Tyrion Lannister',
'Street': 'Kings Landing #89',
'University': 'U of Casterly Rock',
'Zipcode': '43543'}]
-- second edit edited the index on the street. Regards.
Upvotes: 0
Reputation: 781340
You can't use csv
if the file isn't valid CSV format.
You need to call re.split()
on ordinary lines, not on dictionaries.
list = []
with open('file.csv', mode='r') as csv_file:
keys = csv_file.readline().strip().split(',') # Read header line
for line in csv_file:
line = line.strip()
row = re.split(r'(?!\s),(?!\s)',line)
list.append(dict(zip(keys, row)))
Upvotes: 1