Reputation: 165
I have two Pandas Dataframes, named table_A and table_B:
table_A
ID | COST | DATE | MONTH |
---|---|---|---|
CLSLT88150 | 655.118 | 2019-12-30 | 1 |
CLSLT88150 | 652.758 | 2019-12-30 | 1 |
table_B
ID | NUMBER | ORDER | DEVICE |
---|---|---|---|
CLSLT88150 | 4.45882e+07 | WO-SLT-GLI-REC-0009-5067 | Apple iPhone XR |
CLSLT88150 | 4.45882e+07 | WO-SLT-GLI-REC-0009-5786 | Apple iPhone XR |
I want to merge these two tables on column ID to obtain a result similar to this table:
table_merge
ID | COST | DATE | MONTH | ORDER | NUMBER | DEVICE |
---|---|---|---|---|---|---|
CLSLT88150 | 655.118 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5067 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 652.758 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5786 | 4.45882e+07 | Apple iPhone XR |
Each ID can have multiple ORDER. Each ORDER can only appear once and has a unique COST associated. It is indifferent which COST is associated with which ORDER, so this table could also be an acceptable result:
ID | COST | DATE | MONTH | ORDER | NUMBER | DEVICE |
---|---|---|---|---|---|---|
CLSLT88150 | 652.758 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5067 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 655.118 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5786 | 4.45882e+07 | Apple iPhone XR |
I have used the following script to merge table_A with table_B:
table_merge = table_A.merge(table_B)
Obtaining this undesired result:
ID | COST | DATE | MONTH | ORDER | NUMBER | DEVICE |
---|---|---|---|---|---|---|
CLSLT88150 | 655.118 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5067 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 655.118 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5786 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 652.758 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5067 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 652.758 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5786 | 4.45882e+07 | Apple iPhone XR |
I tried and failed to remove the undesired rows with this script:
table_merge = table_merge.drop_duplicates(subset=['ORDER','ID'])
After dropping duplicates table_merge looks like this:
ID | COST | DATE | MONTH | ORDER | NUMBER | DEVICE |
---|---|---|---|---|---|---|
CLSLT88150 | 655.118 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5067 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 655.118 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5786 | 4.45882e+07 | Apple iPhone XR |
As it can be appreciated the COST 655.118 is repeated and the other COST of 652.758 is missing.
How can I obtain the desired result?
Thank you.
Upvotes: 0
Views: 72
Reputation: 165
All undesired repetitions appear in groups of 4 rows. By simply selecting the first and last row the duplicates issue is resolved. The code looks like this:
def resolve_duplicates(df):
indexes = df.index
return df[(df.index == indexes[0]) | (df.index == indexes[-1])]
Using as input:
ID | COST | DATE | MONTH | ORDER | NUMBER | DEVICE |
---|---|---|---|---|---|---|
CLSLT88150 | 655.118 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5067 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 655.118 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5786 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 652.758 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5067 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 652.758 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5786 | 4.45882e+07 | Apple iPhone XR |
The output of the function is:
ID | COST | DATE | MONTH | ORDER | NUMBER | DEVICE |
---|---|---|---|---|---|---|
CLSLT88150 | 655.118 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5067 | 4.45882e+07 | Apple iPhone XR |
CLSLT88150 | 652.758 | 2019-12-30 | 1 | WO-SLT-GLI-REC-0009-5786 | 4.45882e+07 | Apple iPhone XR |
Upvotes: 2