khaoula
khaoula

Reputation: 67

Merge two dataframes without duplicate rows

I have a dataframeA

ID-A  ID'-A   Type   Date 
1      2       A   08/2020
3      2       A   09/2020
3      6       A   09/2020
2      4       A   09/2020
2      8       A   12/2020

and a dataframeB

ID-B  ID'-B   Type   Date 
5      2       B   09/2020
6      3       B   10/2020
8      1       B   10/2020
4      10      B   11/2020

i want this output : iterate over DataframeB by date and if ID'-B = ID-A copy the row next

ID-A  ID'-A   Type   Date   ID-B  ID'-B  Type   Date
1      2       A   08/2020  8      1       B   10/2020
3      2       A   09/2020  6      3       B   10/2020
3      6       A   09/2020  Nan    Nan    Nan  Nan
2      4       A   09/2020  5      2       B   09/2020
2      8       A   12/2020  Nan    Nan     Nan  Nan
Nan    Nan    Nan   Nan     4      10      B   11/2020

I tried to use the merge with keys :

out = pd.merge(DataframeA ,DataframeB ,  how='left' , left_on =['ID-A'] ,  right_on =['ID'-B'])

but it give me a duplicate rows

ID-A  ID'-A   Type   Date   ID-B  ID'-B  Type   Date
1      2       A   08/2020  8      1       B   10/2020
3      2       A   09/2020  6      3       B   10/2020
3      6       A   09/2020  6      3       B   10/2020
2      4       A   09/2020  5      2       B   09/2020
2      8       A   12/2020  5      2       B   09/2020
Nan    Nan    Nan   Nan     4      10      B   11/2020

Upvotes: 0

Views: 314

Answers (1)

Chris
Chris

Reputation: 16147

You're going to have to rename your second dataframes columns, otherwise you're going to end up with duplicate columns and the merge with suffix them with _x and _y.

So you need to rename your dfb column first.

Then you can use duplicated to select only the first records from dfa to merge on, then concat to add the duplicates back after the merge.

import pandas as pd

dfa = pd.DataFrame({'ID-A': {0: 1, 1: 3, 2: 3, 3: 2, 4: 2},
 "ID'-A": {0: 2, 1: 2, 2: 6, 3: 4, 4: 8},
 'Type': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A'},
 'Date': {0: '08/2020',
  1: '09/2020',
  2: '09/2020',
  3: '09/2020',
  4: '12/2020'}})


dfb = pd.DataFrame({'ID-B': {0: 5, 1: 6, 2: 8, 3: 4},
 "ID'-B": {0: 2, 1: 3, 2: 1, 3: 10},
 'Type': {0: 'B', 1: 'B', 2: 'B', 3: 'B'},
 'Date': {0: '09/2020', 1: '10/2020', 2: '10/2020', 3: '11/2020'}})

dfb.columns = ['ID-B',"ID'-B",'Type-B','Date-B']


pd.concat([dfa[~dfa.duplicated(subset=['ID-A'])].merge(dfb, left_on=['ID-A'], right_on=["ID'-B"], how='left'),
           dfa[dfa.duplicated(subset=['ID-A'])]],axis=0)

Output

    ID-A    ID'-A   Type    Date    ID-B    ID'-B   Type-B  Date-B
0   1           2   A     08/2020   8.0      1.0    B       10/2020
1   3           2   A     09/2020   6.0      3.0    B       10/2020
2   2           4   A     09/2020   5.0      2.0    B       09/2020
2   3           6   A     09/2020   NaN      NaN    NaN     NaN
4   2           8   A     12/2020   NaN      NaN    NaN     NaN

Upvotes: 1

Related Questions