Reputation: 7703
df_new = pd.DataFrame(
{
'person_id': [1, 1, 3, 3, 5, 5],
'obs_date': ['12/31/2007', 'NA-NA-NA NA:NA:NA', 'NA-NA-NA NA:NA:NA', '11/25/2009', '10/15/2019', 'NA-NA-NA NA:NA:NA']
})
It looks like as shown below
What I would like to do is replace/fill NA
type rows with actual date values from the same group. For which I tried the below
m1 = df_new['obs_date'].str.contains('^\d')
df_new['obs_date'] = df_new.groupby((m1).cumsum())['obs_date'].transform('first')
But this gives an unexpected output like shown below
Here for the 2nd row it should have been 11/25/2009
from person_id = 3 instead it is from the 1st group of person_id = 1.
How can I get the expected output as shown below
Any elegant and efficient solution is helpful as I am dealing with more than million records
Upvotes: 3
Views: 127
Reputation: 750
df_new= df_new.join(df_new.groupby('person_id')["obs_date"].min(),
on='person_id',
rsuffix="_clean")
Output:
person_id obs_date obs_date_clean
0 1 12/31/2007 12/31/2007
1 1 NA-NA-NA NA:NA:NA 12/31/2007
2 3 NA-NA-NA NA:NA:NA 11/25/2009
3 3 11/25/2009 11/25/2009
4 5 10/15/2019 10/15/2019
5 5 NA-NA-NA NA:NA:NA 10/15/2019
Upvotes: 1
Reputation: 862571
First use to_datetime
with errors='coerce'
for convert non datetimes to missing values, then GroupBy.first
for get first non missing value in GroupBy.transform
new column filled by data:
df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('first')
#alternative - minimal value per group
#df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('min')
print (df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
Another idea is use DataFrame.sort_values
with GroupBy.first
:
df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = (df_new.sort_values(['person_id','obs_date'])
.groupby('person_id')['obs_date']
.ffill())
print (df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
Upvotes: 2
Reputation: 75080
You can do a pd.to_datetime(..,errors='coerce')
to fill non date values as NaT
and ffill
and bfill
after groupby :
df_new['obs_date']=(df_new.assign(obs_date=pd.to_datetime(df_new['obs_date'],
errors='coerce')).groupby('person_id')['obs_date'].apply(lambda x: x.ffill().bfill()))
print(df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
Upvotes: 1