Reputation: 93
Here is the example data set
id firstname lastname email update date
A1 wendy smith [email protected] 2018-01-02
A1 wendy smith [email protected] 2019-02-03
A2 harry lynn [email protected] 2016-04-03
A2 harry [email protected] 2019-03-12
A3 tinna dickey [email protected] 2016-04-03
A3 tinna dickey [email protected] 2013-06-12
A4 Tom Lee [email protected] 2012-06-12
A5 Ella [email protected] 2019-07-12
A6 Ben Lang [email protected] 2019-03-12
I have sorted the data set by id
and update date
, I want to merge the rows with same id
, if one row with empty value, fill the other one with same id
, if confilct, use the latest one. For rows with no duplicate id
leave the empty cell as it is.
the output should be:
id firstname lastname email update date
A1 wendy smith [email protected] 2019-02-03
A2 harry lynn [email protected] 2019-03-12
A3 tinna dickey [email protected] 2019-03-12
A4 Tom Lee [email protected] 2012-06-12
A5 Ella [email protected] 2019-07-12
A6 Ben Lang [email protected] 2019-03-12
my attempt was using ffill()
to merge rows with empty and keep last duplicate, but the result seems to affect other cells which should have empty values(like lastname in A5 should be empty ).
df=df.ffill().drop_duplicates('id',keep='last')
Upvotes: 5
Views: 11898
Reputation: 4315
Use
DataFrame.groupby
- Group DataFrame or Series using a mapper or by a Series of columns..groupby.GroupBy.last
- Compute last of group values.DataFrame.replace
- Replace values given in to_replace with value.Ex.
df = df.replace('',np.nan, regex=True)
df1 = df.groupby('id',as_index=False,sort=False).last()
print(df1)
id firstname lastname email updatedate
0 A1 wendy smith [email protected] 2019-02-03
1 A2 harry lynn [email protected] 2019-03-12
2 A3 tinna dickey [email protected] 2013-06-12
3 A4 Tom Lee [email protected] 2012-06-12
4 A5 Ella NaN [email protected] 2019-07-12
5 A6 Ben Lang [email protected] 2019-03-12
Upvotes: 2
Reputation: 1019
Use a combination of groupby
, apply
, and iloc
:
df.groupby('id', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])
id firstname lastname email update date
0 A1 wendy smith [email protected] 2019-02-03
1 A2 harry lynn [email protected] 2019-03-12
2 A3 tinna dickey [email protected] 2019-03-12
3 A4 Tom Lee [email protected] 2019-06-12
4 A5 Ella NaN [email protected] 2019-07-12
5 A6 Ben Lang [email protected] 2019-03-12
groupby
groups the dataframe by unique idsfillna
fills all the NaN values with the row with non-NaN valuesiloc[-1]
gets you the row with the latest dataUpvotes: 0
Reputation: 166
Try this:
df.groupby('id').ffill().drop_duplicates('id', keep='last')
output:
id firstname lastname email update date
1 A1 wendy smith [email protected] 2019-02-03
3 A2 harry lynn [email protected] 2019-03-12
5 A3 tinna dickey [email protected] 2013-06-12
6 A4 Tom Lee [email protected] 2012-06-12
7 A5 Ella NaN [email protected] 2019-07-12
8 A6 Ben Lang [email protected] 2019-03-12
Upvotes: 1
Reputation: 42886
Use GroupBy.ffill
to only forward fill for the same group. Then use drop_duplicates
:
df['lastname'] = df.groupby('id')['lastname'].ffill()
df = df.drop_duplicates('id', keep='last')
Or in one line (but less readable in my opinion), using assign
:
df.assign(lastname=df.groupby('id')['lastname'].ffill()).drop_duplicates('id', keep='last')
Output
id firstname lastname email update date
1 A1 wendy smith [email protected] 2019-02-03
3 A2 harry lynn [email protected] 2019-03-12
5 A3 tinna dickey [email protected] 2013-06-12
6 A4 Tom Lee [email protected] 2012-06-12
7 A5 Ella NaN [email protected] 2019-07-12
8 A6 Ben Lang [email protected] 2019-03-12
Upvotes: 4