user2748930
user2748930

Reputation: 93

Pandas | merge rows with same id

Here is the example data set

id         firstname    lastname      email               update date
A1         wendy         smith         [email protected]        2018-01-02
A1         wendy         smith         [email protected]     2019-02-03 
A2         harry         lynn          [email protected]      2016-04-03
A2         harry                       [email protected]  2019-03-12
A3         tinna         dickey        [email protected]     2016-04-03
A3         tinna         dickey        [email protected]     2013-06-12
A4         Tom           Lee           [email protected]       2012-06-12
A5         Ella                        [email protected]      2019-07-12
A6         Ben           Lang          [email protected]       2019-03-12

I have sorted the data set by id and update date, I want to merge the rows with same id, if one row with empty value, fill the other one with same id, if confilct, use the latest one. For rows with no duplicate id leave the empty cell as it is.

the output should be:

id         firstname    lastname      email               update date
A1         wendy         smith         [email protected]     2019-02-03 
A2         harry         lynn          [email protected]  2019-03-12
A3         tinna         dickey        [email protected]     2019-03-12
A4         Tom           Lee           [email protected]       2012-06-12
A5         Ella                        [email protected]      2019-07-12
A6         Ben           Lang          [email protected]       2019-03-12

my attempt was using ffill() to merge rows with empty and keep last duplicate, but the result seems to affect other cells which should have empty values(like lastname in A5 should be empty ).

df=df.ffill().drop_duplicates('id',keep='last')

Upvotes: 5

Views: 11898

Answers (4)

bharatk
bharatk

Reputation: 4315

Use

Ex.

df = df.replace('',np.nan, regex=True)
df1 = df.groupby('id',as_index=False,sort=False).last()
print(df1)

   id firstname lastname              email  updatedate
0  A1     wendy    smith     [email protected]  2019-02-03
1  A2     harry     lynn  [email protected]  2019-03-12
2  A3     tinna   dickey     [email protected]  2013-06-12
3  A4       Tom      Lee       [email protected]  2012-06-12
4  A5      Ella      NaN      [email protected]  2019-07-12
5  A6       Ben     Lang       [email protected]  2019-03-12

Upvotes: 2

adrianp
adrianp

Reputation: 1019

Use a combination of groupby, apply, and iloc:

df.groupby('id', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])

   id firstname lastname              email  update date
0  A1     wendy    smith     [email protected]  2019-02-03
1  A2     harry     lynn  [email protected]  2019-03-12
2  A3     tinna   dickey     [email protected]  2019-03-12
3  A4       Tom      Lee       [email protected]  2019-06-12
4  A5      Ella      NaN      [email protected]  2019-07-12
5  A6       Ben     Lang       [email protected]  2019-03-12
  • groupby groups the dataframe by unique ids
  • fillna fills all the NaN values with the row with non-NaN values
  • iloc[-1] gets you the row with the latest data

Upvotes: 0

Divya Dass
Divya Dass

Reputation: 166

Try this:

df.groupby('id').ffill().drop_duplicates('id', keep='last')

output:

   id firstname lastname              email  update date
1  A1     wendy    smith     [email protected]  2019-02-03 
3  A2     harry     lynn  [email protected]   2019-03-12
5  A3     tinna   dickey     [email protected]   2013-06-12
6  A4       Tom      Lee       [email protected]   2012-06-12
7  A5      Ella      NaN      [email protected]   2019-07-12
8  A6       Ben     Lang       [email protected]   2019-03-12

Upvotes: 1

Erfan
Erfan

Reputation: 42886

Use GroupBy.ffill to only forward fill for the same group. Then use drop_duplicates:

df['lastname'] = df.groupby('id')['lastname'].ffill()
df = df.drop_duplicates('id', keep='last')

Or in one line (but less readable in my opinion), using assign:

df.assign(lastname=df.groupby('id')['lastname'].ffill()).drop_duplicates('id', keep='last')

Output

   id firstname lastname              email update date
1  A1     wendy    smith     [email protected]  2019-02-03
3  A2     harry     lynn  [email protected]  2019-03-12
5  A3     tinna   dickey     [email protected]  2013-06-12
6  A4       Tom      Lee       [email protected]  2012-06-12
7  A5      Ella      NaN      [email protected]  2019-07-12
8  A6       Ben     Lang       [email protected]  2019-03-12

Upvotes: 4

Related Questions