Combining three datasets removing duplicates

Question

I've three datasets:

dataset 1

Customer1            Customer2          Exposures  + other columns
Nick McKenzie       Christopher Mill      23450
Nick McKenzie       Stephen Green         23450
Johnny Craston      Mary Shane              12
Johnny Craston      Stephen Green           12
Molly John          Casey Step          1000021

dataset2 (unique Customers: Customer 1 + Customer 2)

Customer            Age
Nick McKenzie       53
Johnny Craston      75
Molly John          34
Christopher Mill    63
Stephen Green       65
Mary Shane          54
Casey Step          34
Mick Sale

dataset 3

Customer1            Customer2          Exposures  + other columns
Mick Sale           Johnny Craston           
Mick Sale           Stephen Green

Exposures refers to Customer 1 only.

There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.

I'd like to add the information from dataset 1 into dataset 2 to have

Final dataset

Customer            Age    Exposures   +   other columns
Nick McKenzie       53       23450
Johnny Craston      75       12
Molly John          34      1000021
Christopher Mill    63
Stephen Green       65
Mary Shane          54
Casey Step          34
Mick Sale

The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates. I have tried to combine them as follows

result = pd.concat([df2,df1,df3], axis=1)

but the result is not that one I'd expect. Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.

user2736738 · Accepted Answer

After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this

df1.set_index('Customer1').join(df2.set_index('Customer'))

In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.

This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.

Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)

Combining three datasets removing duplicates

Answers (1)

Related Questions