Reputation: 35
I have a dataset like this:
customer_id customer_name order_id
1 0000A1 CompanyA 7e2e3978
2 0000A1 CompanyA 7e2e3de2
3 0000A1 CompanyA 7e2e3efa
4 0000B1 CompanyB 7e2e3fc2
5 0000B1 CompanyA 7e2e408a
6 0000B1 CompanyB 7e2e4148
7 0000C1 CompanC 7e2e4206
8 0000C1 CompanyC 7e2e42c4
9 0000C1 CompanyC 7e2e4512
The dataset is sorted using the customer_id. There are a number of ids(customer_id) and values(customer_name) that is supposed to be corresponding to each other (the same id should have the same value). But there are some rows with incorrect data (row 5 and 7 in this case). I want to use pandas to find out these rows.
Now I'm writing my code using some if else loop:
xlsx = pandas.ExcelFile('order-table.xlsx')
df = pandas.read_excel(xlsx, 'Sheet1')
previous_id = "0000A1"
previous_value = "CompanyA"
for (idx, row) in df.iterrows():
current_id = row.loc['customer_id']
current_value = row.loc['customer_name']
if current_id == previous_id:
if current_value == previous_value:
df.loc[idx, "same"] = "true"
else:
df.loc[idx, "same"] = "false"
else:
previous_id = current_id
previous_value = current_value
df.to_excel("order-table-marked.xlsx")
This can generate a column that can mark out the rows with incorrect data. But I assume it's not the best approach. Is there a better way of doing this in pandas? Is it faster to do this using groupby() or drop_duplicate() and how to do it?
Upvotes: 1
Views: 1620
Reputation: 171
This is a tough problem because you have to determine which customer_name is the correct one, and which is are errors. Your attempt at a solution assumes the first instance is the correct one, but that isn't necessarily always true.
The best way would be to build a dictionary directly mapping the customer_id to the customer_name, it would then be easy enough to backfill and/or check which answers are correct. But building the dictionary manually could be very difficult and/or time consuming.
One way of getting around this is to assume that for each customer_id, the customer_name that occurs most often is the correct one. In which case it should be as simple as this:
df = pd.DataFrame({'customer_id': ['0000A1', '0000A1', '0000A1', '0000B1', '0000B1', '0000B1'],
'customer_name': ['CompanyA', 'CompanyB', 'CompanyA', 'CompanB', 'CompanyB', 'CompanyB']})
for id in df['customer_id'].unique():
correct_name = df[df['customer_id'] == id]['customer_name'].value_counts().idxmax()
df.loc[df['customer_id'] == id, 'same'] = (df[df['customer_id'] == id]['customer_name'] == correct_name)
customer_id customer_name same
0 0000A1 CompanyA True
1 0000A1 CompanyB False
2 0000A1 CompanyA True
3 0000B1 CompanB False
4 0000B1 CompanyB True
5 0000B1 CompanyB True
This might not work 100% of the time, if the wrong customer_name is used more often or the same amount as the correct one then we can't programmatically figure out which one is the right one to use (unless there are some overarching rules like length or format that you can condition on).
I hope this gives you something you can work from at least.
Upvotes: 2