Joining Two Dataframes while combining two identical columns into one

Question

I am trying to join two dataframes by a buying group ID key. The first dataframe contains the buying group code, customer ID and start/end dates for each customer on a buying group.
The head looks like:

Buying_group, Customer_ID, BG_Start_Dt, BG_End_Dt
123           0001         2020-1-1     2025-1-1      
123           0002         2021-1-1     2024-1-1      
132           0033         2019-1-1     2023-1-1
132           0044         2019-1-1     2023-1-1

The other contains the buying group code, a contract number, customer ID and start/end dates for the contract.
The head looks like:

Buying_group, Contract, ContractStDt, ContractEnDt, Customer_ID 
123           1             2020-1-1      2025-1-1      0001
123           2             2021-1-1      2024-1-1      0002
156           4             2019-1-1      2023-1-1      0003
156           4             2019-1-1      2023-1-1      0004

I am trying to join each df so I have a master df that combines the buying group dates from the first df. The problem I am running in to is when I do an outer or inner join of the two df's by the Buying_group key, that the resulting df has two columns for Customer ID:

Customer_IDx, Customer_IDy

And the dataframe looks like this:

Buying_group, Contract, ContractStDt, ContractEnDt, BG_Start_Dt, BG_End_Dt Customer_IDx, Customer_IDy 
123           1             2020-1-1      2025-1-1  2020-1-1     2025-1-1     0001        0001
123           2             2021-1-1      2024-1-1  2021-1-1     2024-1-1     0002        0002
156           4             2019-1-1      2023-1-1  2022-1-1     2029-1-1     null        0003
156           4             2019-1-1      2023-1-1  2022-1-1     2029-1-1     null        0004
132           5             2019-1-1      2023-1-1  2019-1-1     2023-1-1     0033        null
132           5             2019-1-1      2023-1-1  2019-1-1     2023-1-1     0044        null

The problem here is that each row corresponds to one unique contract for each customer and the merged dataframe needs to have a row for each unique customer ID. I'm really new to these SQL type joins and am wondering if there is a way to do this. Combine both customer ID columns for each joining DF into one column while preserving their unique rows. I tried doing both inner and outer joins but its the same result.

The ideal dataframe should look like this:

Buying_group, Contract, ContractStDt, ContractEnDt, BG_Start_Dt, BG_End_Dt, Customer_ID 
123           1             2020-1-1      2025-1-1  2020-1-1     2025-1-1    0001       
123           2             2021-1-1      2024-1-1  2021-1-1     2024-1-1    0002        
156           4             2019-1-1      2023-1-1  2022-1-1     2029-1-1    0003
156           4             2019-1-1      2023-1-1  2022-1-1     2029-1-1    0004
132           5             2019-1-1      2023-1-1  2019-1-1     2023-1-1    0033        
132           5             2019-1-1      2023-1-1  2019-1-1     2023-1-1    0044

Rishin Rahim · Accepted Answer

Try this

df_3 = pd.concat([df1,df2]).drop_duplicates(subset=['Buying_group','Customer_ID'], 
                                            keep="last").reset_index(drop=True)
df_3

Output

    Buying_group    Customer_ID BG_Start_Dt BG_End_Dt   contract
0   132                00033    2019-1-1    2023-1-1    NaN
1   132                00044    2019-1-1    2023-1-1    NaN
2   123                0001     2020-1-1    2025-1-1    1.0
3   123                0002     2021-1-1    2024-1-1    2.0
4   156                0003     2019-1-1    2023-1-1    4.0
5   156                0004     2019-1-1    2023-1-1    4.0

Joining Two Dataframes while combining two identical columns into one

Answers (1)

Related Questions