dingo
dingo

Reputation: 323

Pandas replace last item in groupby if NaN with another colum

I'm trying to replace the last row in a group by with the value of another column only if it is null. I am able to do both of these pieces separately but can't seem to combine them. Anyone have any ideas?

These are the separate pieces:

# replace any NaN values with values from 'target'
df.loc[df['target'].isnull(),'target'] = df['value']

# replace last value in groupby with value from 'target'
df.loc[df.groupby('id').tail(1).index,'target'] = df['value']

Original Data:

    date        id      value       target
0   2020-08-07  id01    0.100775    NaN
1   2020-08-08  id01    0.215885    0.215885
2   2020-08-09  id01    0.012154    0.012154
3   2020-08-10  id01    0.374503    NaN
4   2020-08-07  id02    0.369707    0.369707
5   2020-08-08  id02    0.676743    0.676743
6   2020-08-09  id02    0.659521    0.659521
7   2020-08-10  id02    0.799071    NaN

Replace 'target' column with last row in groupby('id') with what is in 'value':

    date        id      value       target
0   2020-08-07  id01    0.100775    NaN
1   2020-08-08  id01    0.215885    0.215885
2   2020-08-09  id01    0.012154    0.012154
3   2020-08-10  id01    0.374503    0.374503
4   2020-08-07  id02    0.369707    0.369707
5   2020-08-08  id02    0.676743    0.676743
6   2020-08-09  id02    0.659521    0.659521
7   2020-08-10  id02    0.799071    0.799071

Upvotes: 4

Views: 563

Answers (4)

wwnde
wwnde

Reputation: 26676

With combine_first, you are spoiled for choices.

First option

Use .groupby() with nth(value)

m=df.groupby('id',as_index=False).nth(-1).index#mask each last value in the group
df.loc[m, 'target'] = df['target'].combine_first(df['value'])populate value

Second Option

Use combine_first using .iloc accessor to get the last values in the target column and value column in a one line groupby

     df.groupby('id').apply(lambda x:(x.iloc[-1:,3].combine_first(x.iloc[-1:,2])))\
.reset_index(level=0).combine_first(df)

Third Option

Select the last index in each group. Populate column target as required and update df using combine_first

g=df.groupby('id').apply(lambda x:x.iloc[-1:]).reset_index(level=0, drop=True)
#df.loc[g, 'target'] = df['target'].combine_first(df['value'])
g.target=g.value
g.combine_first(df)



   date    id     value    target
0  2020-08-07  id01  0.100775       NaN
1  2020-08-08  id01  0.215885  0.215885
2  2020-08-09  id01  0.012154  0.012154
3  2020-08-10  id01  0.374503  0.374503
4  2020-08-07  id02  0.369707  0.369707
5  2020-08-08  id02  0.676743  0.676743
6  2020-08-09  id02  0.659521  0.659521
7  2020-08-10  id02  0.799071  0.799071

Upvotes: 1

Terry
Terry

Reputation: 2811

Find the index of the last target inside groupby() and then replace only null values using .combine_first()

indexes = df.groupby('id').tail(1).index
df.loc[indexes, 'target'] = df['target'].combine_first(df['value'])
#result
    date        id      value       target
0   2020-08-07  id01    0.100775    NaN
1   2020-08-08  id01    0.215885    0.215885
2   2020-08-09  id01    0.012154    0.012154
3   2020-08-10  id01    0.374503    0.374503
4   2020-08-07  id02    0.369707    0.369707
5   2020-08-08  id02    0.676743    0.676743
6   2020-08-09  id02    0.659521    0.659521
7   2020-08-10  id02    0.799071    0.799071

Upvotes: 1

ALollz
ALollz

Reputation: 59569

fillna on the entire column but mask back to NaN if it's a missing that isn't the last for each 'id'.

m = df['target'].isnull() & df['id'].duplicated(keep='last')
df['target'] = df['target'].fillna(df['value']).mask(m)

         date    id     value    target
0  2020-08-07  id01  0.100775       NaN
1  2020-08-08  id01  0.215885  0.215885
2  2020-08-09  id01  0.012154  0.012154
3  2020-08-10  id01  0.374503  0.374503
4  2020-08-07  id02  0.369707  0.369707
5  2020-08-08  id02  0.676743  0.676743
6  2020-08-09  id02  0.659521  0.659521
7  2020-08-10  id02  0.799071  0.799071

Upvotes: 3

Juan C
Juan C

Reputation: 6132

This should do. Added the tail variable just for easier to read syntaxis:

tail = df.groupby('id').tail(1)
df.loc[tail.index,'target'] = df.loc[tail.index]['target'].fillna(tail.value) 

Output:

0 idx        date    id     value    target
1   0  2020-08-07  id01  0.100775       NaN
2   1  2020-08-08  id01  0.215885  0.215885
3   2  2020-08-09  id01  0.012154  0.012154
4   3  2020-08-10  id01  0.374503  0.374503
5   4  2020-08-07  id02  0.369707  0.369707
6   5  2020-08-08  id02  0.676743  0.676743
7   6  2020-08-09  id02  0.659521  0.659521
8   7  2020-08-10  id02  0.799071  0.799071

Upvotes: 3

Related Questions