Nublia
Nublia

Reputation: 35

pandas: check if the same id has the same value in a dataset

I have a dataset like this:

  customer_id customer_name order_id
1 0000A1       CompanyA     7e2e3978
2 0000A1       CompanyA     7e2e3de2
3 0000A1       CompanyA     7e2e3efa
4 0000B1       CompanyB     7e2e3fc2
5 0000B1       CompanyA     7e2e408a
6 0000B1       CompanyB     7e2e4148
7 0000C1       CompanC      7e2e4206
8 0000C1       CompanyC     7e2e42c4
9 0000C1       CompanyC     7e2e4512

The dataset is sorted using the customer_id. There are a number of ids(customer_id) and values(customer_name) that is supposed to be corresponding to each other (the same id should have the same value). But there are some rows with incorrect data (row 5 and 7 in this case). I want to use pandas to find out these rows.

Now I'm writing my code using some if else loop:

xlsx = pandas.ExcelFile('order-table.xlsx')
df = pandas.read_excel(xlsx, 'Sheet1')
previous_id = "0000A1"
previous_value = "CompanyA"
for (idx, row) in df.iterrows():
    current_id = row.loc['customer_id']
    current_value = row.loc['customer_name']
    if current_id == previous_id:
        if current_value == previous_value:
            df.loc[idx, "same"] = "true"
        else:
            df.loc[idx, "same"] = "false"
    else:
        previous_id = current_id
        previous_value = current_value

df.to_excel("order-table-marked.xlsx")

This can generate a column that can mark out the rows with incorrect data. But I assume it's not the best approach. Is there a better way of doing this in pandas? Is it faster to do this using groupby() or drop_duplicate() and how to do it?

Upvotes: 1

Views: 1620

Answers (1)

Joel Leeb-du Toit
Joel Leeb-du Toit

Reputation: 171

This is a tough problem because you have to determine which customer_name is the correct one, and which is are errors. Your attempt at a solution assumes the first instance is the correct one, but that isn't necessarily always true.

The best way would be to build a dictionary directly mapping the customer_id to the customer_name, it would then be easy enough to backfill and/or check which answers are correct. But building the dictionary manually could be very difficult and/or time consuming.

One way of getting around this is to assume that for each customer_id, the customer_name that occurs most often is the correct one. In which case it should be as simple as this:

df = pd.DataFrame({'customer_id': ['0000A1', '0000A1', '0000A1', '0000B1', '0000B1', '0000B1'],
              'customer_name': ['CompanyA', 'CompanyB', 'CompanyA', 'CompanB', 'CompanyB', 'CompanyB']})

for id in df['customer_id'].unique():
  correct_name = df[df['customer_id'] == id]['customer_name'].value_counts().idxmax()
  df.loc[df['customer_id'] == id, 'same'] = (df[df['customer_id'] == id]['customer_name'] == correct_name)

    customer_id customer_name   same
0   0000A1      CompanyA        True
1   0000A1      CompanyB        False
2   0000A1      CompanyA        True
3   0000B1      CompanB         False
4   0000B1      CompanyB        True
5   0000B1      CompanyB        True

This might not work 100% of the time, if the wrong customer_name is used more often or the same amount as the correct one then we can't programmatically figure out which one is the right one to use (unless there are some overarching rules like length or format that you can condition on).

I hope this gives you something you can work from at least.

Upvotes: 2

Related Questions