Reputation: 2552
I have a dataframe of the following form: (many more columns than just these - removed for brevity)
import pandas as pd
headers = ['A','B','C']
data = [['p1','','v1'],
['p2','','ba'],
['p3',9,'fg'],
['p1',1,'fg'],
['p2',45,'af'],
['p3',1,'fg'],
['p1',1,'hf']
]
df = pd.DataFrame(data,columns=headers)
A B C
0 p1 v1
1 p2 ba
2 p3 9 fg
3 p1 1 fg
4 p2 45 af
5 p3 1 fg
6 p1 1 hf
Column B has duplicates, whereby the latest value should be non-NA (but may not be)
I want to replace col B values with the latest non-NA value. Something like this:
unique_people = df['A'].unique()
for person in unique_people:
sub_df = df[df['A'] == person]
val = sub_df['B'].tail(1).values
df['A'][df['A'] == person] = val # this also doesnt work because its not inplace
Im sure theres a better way to do it, but im not sure how. Could anyone point a better way?
Thanks!
Upvotes: 1
Views: 626
Reputation: 863216
First replace empty strings to missing values and then use GroupBy.transform
with GroupBy.last
for last non missing values per groups:
headers = ['A','B','C']
data = [['p1','','v1'],
['p2','','ba'],
['p3',9,'fg'],
['p1',1,'fg'],
['p2',45,'af'],
['p3',1,'fg'],
['p1','','hf']
]
df = pd.DataFrame(data,columns=headers)
df['B'] = df['B'].replace('', np.nan)
df['B'] = df.groupby('A')['B'].transform('last')
print (df)
A B C
0 p1 1.0 v1
1 p2 45.0 ba
2 p3 1.0 fg
3 p1 1.0 fg
4 p2 45.0 af
5 p3 1.0 fg
6 p1 1.0 hf
Upvotes: 1