The Great
The Great

Reputation: 7703

Efficient and elegant way to fill values in pandas column based on each groups

df_new = pd.DataFrame(
{
    'person_id': [1, 1, 3, 3, 5, 5],
    'obs_date': ['12/31/2007', 'NA-NA-NA NA:NA:NA', 'NA-NA-NA NA:NA:NA', '11/25/2009', '10/15/2019', 'NA-NA-NA NA:NA:NA']
})

It looks like as shown below

enter image description here

What I would like to do is replace/fill NA type rows with actual date values from the same group. For which I tried the below

m1 = df_new['obs_date'].str.contains('^\d')
df_new['obs_date'] = df_new.groupby((m1).cumsum())['obs_date'].transform('first')

But this gives an unexpected output like shown below

enter image description here

Here for the 2nd row it should have been 11/25/2009 from person_id = 3 instead it is from the 1st group of person_id = 1.

How can I get the expected output as shown below

enter image description here

Any elegant and efficient solution is helpful as I am dealing with more than million records

Upvotes: 3

Views: 127

Answers (3)

E. Sommer
E. Sommer

Reputation: 750

df_new= df_new.join(df_new.groupby('person_id')["obs_date"].min(),
                    on='person_id',
                    rsuffix="_clean")

Output:

   person_id           obs_date obs_date_clean
0          1         12/31/2007     12/31/2007
1          1  NA-NA-NA NA:NA:NA     12/31/2007
2          3  NA-NA-NA NA:NA:NA     11/25/2009
3          3         11/25/2009     11/25/2009
4          5         10/15/2019     10/15/2019
5          5  NA-NA-NA NA:NA:NA     10/15/2019

Upvotes: 1

jezrael
jezrael

Reputation: 862571

First use to_datetime with errors='coerce' for convert non datetimes to missing values, then GroupBy.first for get first non missing value in GroupBy.transform new column filled by data:

df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('first')
#alternative - minimal value per group
#df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('min')

print (df_new)
   person_id   obs_date
0          1 2007-12-31
1          1 2007-12-31
2          3 2009-11-25
3          3 2009-11-25
4          5 2019-10-15
5          5 2019-10-15

Another idea is use DataFrame.sort_values with GroupBy.first:

df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = (df_new.sort_values(['person_id','obs_date'])
                            .groupby('person_id')['obs_date']
                            .ffill())

print (df_new)
   person_id   obs_date
0          1 2007-12-31
1          1 2007-12-31
2          3 2009-11-25
3          3 2009-11-25
4          5 2019-10-15
5          5 2019-10-15

Upvotes: 2

anky
anky

Reputation: 75080

You can do a pd.to_datetime(..,errors='coerce') to fill non date values as NaT and ffill and bfill after groupby :

df_new['obs_date']=(df_new.assign(obs_date=pd.to_datetime(df_new['obs_date'],
 errors='coerce')).groupby('person_id')['obs_date'].apply(lambda x: x.ffill().bfill()))
print(df_new)

   person_id   obs_date
0          1 2007-12-31
1          1 2007-12-31
2          3 2009-11-25
3          3 2009-11-25
4          5 2019-10-15
5          5 2019-10-15

Upvotes: 1

Related Questions