GGS
GGS

Reputation: 165

Resolve duplicates after Pandas merge

I have two Pandas Dataframes, named table_A and table_B:

table_A

ID COST DATE MONTH
CLSLT88150 655.118 2019-12-30 1
CLSLT88150 652.758 2019-12-30 1

table_B

ID NUMBER ORDER DEVICE
CLSLT88150 4.45882e+07 WO-SLT-GLI-REC-0009-5067 Apple iPhone XR
CLSLT88150 4.45882e+07 WO-SLT-GLI-REC-0009-5786 Apple iPhone XR

I want to merge these two tables on column ID to obtain a result similar to this table:

table_merge

ID COST DATE MONTH ORDER NUMBER DEVICE
CLSLT88150 655.118 2019-12-30 1 WO-SLT-GLI-REC-0009-5067 4.45882e+07 Apple iPhone XR
CLSLT88150 652.758 2019-12-30 1 WO-SLT-GLI-REC-0009-5786 4.45882e+07 Apple iPhone XR

Each ID can have multiple ORDER. Each ORDER can only appear once and has a unique COST associated. It is indifferent which COST is associated with which ORDER, so this table could also be an acceptable result:

ID COST DATE MONTH ORDER NUMBER DEVICE
CLSLT88150 652.758 2019-12-30 1 WO-SLT-GLI-REC-0009-5067 4.45882e+07 Apple iPhone XR
CLSLT88150 655.118 2019-12-30 1 WO-SLT-GLI-REC-0009-5786 4.45882e+07 Apple iPhone XR

I have used the following script to merge table_A with table_B:

table_merge = table_A.merge(table_B)

Obtaining this undesired result:

ID COST DATE MONTH ORDER NUMBER DEVICE
CLSLT88150 655.118 2019-12-30 1 WO-SLT-GLI-REC-0009-5067 4.45882e+07 Apple iPhone XR
CLSLT88150 655.118 2019-12-30 1 WO-SLT-GLI-REC-0009-5786 4.45882e+07 Apple iPhone XR
CLSLT88150 652.758 2019-12-30 1 WO-SLT-GLI-REC-0009-5067 4.45882e+07 Apple iPhone XR
CLSLT88150 652.758 2019-12-30 1 WO-SLT-GLI-REC-0009-5786 4.45882e+07 Apple iPhone XR

I tried and failed to remove the undesired rows with this script:

table_merge = table_merge.drop_duplicates(subset=['ORDER','ID'])

After dropping duplicates table_merge looks like this:

ID COST DATE MONTH ORDER NUMBER DEVICE
CLSLT88150 655.118 2019-12-30 1 WO-SLT-GLI-REC-0009-5067 4.45882e+07 Apple iPhone XR
CLSLT88150 655.118 2019-12-30 1 WO-SLT-GLI-REC-0009-5786 4.45882e+07 Apple iPhone XR

As it can be appreciated the COST 655.118 is repeated and the other COST of 652.758 is missing.

How can I obtain the desired result?

Thank you.

Upvotes: 0

Views: 72

Answers (1)

GGS
GGS

Reputation: 165

All undesired repetitions appear in groups of 4 rows. By simply selecting the first and last row the duplicates issue is resolved. The code looks like this:

def resolve_duplicates(df):
    indexes = df.index
    return df[(df.index == indexes[0]) | (df.index == indexes[-1])]

Using as input:

ID COST DATE MONTH ORDER NUMBER DEVICE
CLSLT88150 655.118 2019-12-30 1 WO-SLT-GLI-REC-0009-5067 4.45882e+07 Apple iPhone XR
CLSLT88150 655.118 2019-12-30 1 WO-SLT-GLI-REC-0009-5786 4.45882e+07 Apple iPhone XR
CLSLT88150 652.758 2019-12-30 1 WO-SLT-GLI-REC-0009-5067 4.45882e+07 Apple iPhone XR
CLSLT88150 652.758 2019-12-30 1 WO-SLT-GLI-REC-0009-5786 4.45882e+07 Apple iPhone XR

The output of the function is:

ID COST DATE MONTH ORDER NUMBER DEVICE
CLSLT88150 655.118 2019-12-30 1 WO-SLT-GLI-REC-0009-5067 4.45882e+07 Apple iPhone XR
CLSLT88150 652.758 2019-12-30 1 WO-SLT-GLI-REC-0009-5786 4.45882e+07 Apple iPhone XR

Upvotes: 2

Related Questions