Reputation: 361
I have a dataframe constructed as so:
df = pd.DataFrame({'id': [1,2,3,4,1,2,3,4],
'birthdate': ['01-01-01','02-02-02','03-03-03','04-04-04',
'','02-02-02','03-04-04','04-03-04']})
df['birthdate'] = pd.to_datetime(df['birthdate'])
I want to do a groupby to change the original data using pandas .transform
The condition is that I want to pick the birthdate
value of the first not null row per id
I know I can do max
if no other option is available to get rid of the not null entries, but if there are inconsistencies, I don't necessarily want the maximum date, just the one that occurs first in the dataframe.
As such:
df['birthdate'] = df.groupby('id')['birthdate'].transform(max)
This is how output looks using max
:
id birthdate
0 1 2001-01-01
1 2 2002-02-02
2 3 2003-03-03
3 4 2004-04-04
4 1 2001-01-01
5 2 2002-02-02
6 3 2004-03-04
7 4 2004-04-04
This is how I actually want it to look:
id birthdate
0 1 2001-01-01
1 2 2002-02-02
2 3 2003-03-03
3 4 2004-04-04
4 1 2001-01-01
5 2 2002-02-02
6 3 2003-03-03
7 4 2004-04-04
I'm pretty sure I have to create a customer lambda
to put inside the .transform
but I am unsure what condition to use.
Upvotes: 3
Views: 533
Reputation: 1811
You can try the following. Your dataframe definition and suggested outputs contain different dates, so I assumed your dataframe definition was correct
df['birthdate'] = df.groupby('id').transform('first')
which outputs.
id birthdate
0 1 2001-01-01
1 2 2002-02-02
2 3 2003-03-03
3 4 2004-04-04
4 1 2001-01-01
5 2 2002-02-02
6 3 2003-03-03
7 4 2004-04-04
Upvotes: 2