Reputation: 543
An extension to Problems reading CSV file with commas and characters in pandas
Solution provided in the above link works if and only if one column which contains commas as its characters and rest of the columns are well behaved.
What if there are more than one column with above issue?
Example CSV content with additional commas issue :
Name,Age,Address,Phone,Qualification
Suresh,28,Texas,3334567892,B.Tech
Ramesh,24,NewYork, NY,8978974040,9991111234,Ph.D
Mukesh,26,Dallas,4547892345,Ph.D
Required Output Pandas DataFrame:
Name Age Address Phone Qualification
Suresh 28 Texas 3334567892 B.Tech
Ramesh 24 NewYork, NY 8978974040,9991111234 Ph.D
Mukesh 26 Dallas 4547892345 Ph.D
Edited :
Input file with commas as characters in successive columns :
Name,Age,Address,Qualification,Grade
Suresh,28,Texas,B.Tech,Ph.D,A
Ramesh,24,NewYork, NY,B.Tech,A+
Mukesh,26,Dallas,B.Tech,Ph.D,A
Required Output Pandas DataFrame:
Name Age Address Qualification Grade
Suresh 28 Texas B.Tech,Ph.D A
Ramesh 24 NewYork, NY B.Tech A+
Mukesh 26 Dallas B.Tech,Ph.D A
Can I get any suggestions to solve this issue?
Thanks in Advance!!!
Upvotes: 0
Views: 2839
Reputation: 46759
Your data appears fixed for the first two columns and also the last, so these can be removed and the remaining values could be processed using itertools.groupby()
to group the remaining columns into numeric or non-numeric groups. The resulting data could then be loaded into pandas:
import pandas as pd
from itertools import groupby
import csv
data = []
with open('input.csv', newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
addr_phone = [','.join(g) for k, g in groupby(row[2:-1], lambda x: x.isdigit())]
data.append(row[:2] + addr_phone + [row[-1]])
df = pd.DataFrame(data, columns=header)
print(df)
Giving you:
Name Age Address Phone Qualification
0 Suresh 28 Texas 3334567892 B.Tech
1 Ramesh 24 NewYork, NY 8978974040,9991111234 Ph.D
2 Mukesh 26 Dallas 4547892345 Ph.D
To work with your second example, you would have to decide on a way to split the two columns. I would suggest you create a list of possible qualifications. When there is a match, you would be able to split at that point. For example:
import pandas as pd
import csv
def find_split(data):
for index, v in enumerate(data):
if v.lower() in ['b.tech', 'ph.d']:
return [', '.join(data[:index]), ', '.join(data[index:])]
return [', '.join(data), '']
data = []
with open('input.csv', newline='') as f_input:
csv_input = csv.reader(f_input, skipinitialspace=True)
header = next(csv_input)
for row in csv_input:
data.append(row[:2] + find_split(row[2:-1]) + [row[-1]])
df = pd.DataFrame(data, columns=header)
print(df)
Giving you:
Name Age Address Qualification Grade
0 Suresh 28 Texas B.Tech, Ph.D A
1 Ramesh 24 NewYork, NY B.Tech A+
2 Mukesh 26 Dallas B.Tech, Ph.D A
You could create a list of qualifications by first creating a set()
based on the contents of row[2]
(lowercased). Print the contents of the set and then add that to the script and rerun it.
Upvotes: 2
Reputation: 9081
One way to do this would be to have "
to clearly separate your data -
Name,Age,Address,Phone,Qualification
Suresh,28,Texas,3334567892,B.Tech
Ramesh,24,"NewYork, NY","8978974040,9991111234",Ph.D
Mukesh,26,Dallas,4547892345,Ph.D
If this isn't there, pandas
will struggle to read it right.
Copy the above data, do a pd.read_clipboard(sep=',')
and it will yield -
Name Age Address Phone Qualification
0 Suresh 28 Texas 3334567892 B.Tech
1 Ramesh 24 NewYork, NY 8978974040,9991111234 Ph.D
2 Mukesh 26 Dallas 4547892345 Ph.D
If modifying the source data as a whole is not within your means-
A practical approach would be to do a usual read_csv
with error_bad_lines=False
. Once done, look through the logs and make a note of the lines that pandas
is struggling to read and modify only those lines accordingly.
Hope this helps.
Upvotes: 3