Reputation: 507
I would like to find the minimum value of the column date (pd.to_datetime format) other than "1777-07-07" which is basically outlier. The input dataframe is as shown
col2 date
b1a2 1777-07-07
b1a2 2012-09-14
b1a2 1777-07-07
b1a2 1777-07-07
b1a2 2017-09-14
b1a2 2019-09-24
b1a2 2012-09-14
b1a2 2012-09-14
b1a2 2012-09-28
a1l2 1777-07-07
a1l2 2012-09-24
a1l2 2012-09-24
a1l2 2002-09-28
a1l2 2012-09-24
a1l2 2008-09-14
a1l2 2012-09-24
So when i tred the following
df = df.join(df.groupby(['col2'])['date'].agg({'earliest':'min'}),on=['disability_case_id'])
df = df.join(df.groupby(['col2'])['date'].agg({'latest':'max'}),on=['disability_case_id'])
This one gives me both max and min values as shown
col2 date earliset latest
b1a2 1777-07-07 1777-07-07 2019-09-24
b1a2 2012-09-14 1777-07-07 2019-09-24
b1a2 2017-09-14 1777-07-07 2019-09-24
b1a2 2019-09-24 1777-07-07 2019-09-24
b1a2 2012-09-14 1777-07-07 2019-09-24
b1a2 2012-09-14 1777-07-07 2019-09-24
b1a2 2012-09-28 1777-07-07 2019-09-24
a1l2 1777-07-07 1777-07-07 2012-09-28
a1l2 2012-09-24 1777-07-07 2012-09-28
a1l2 2012-09-28 1777-07-07 2012-09-28
a1l2 2002-09-28 1777-07-07 2012-09-28
a1l2 2012-09-24 1777-07-07 2012-09-28
a1l2 2008-09-14 1777-07-07 2012-09-28
a1l2 2012-09-24 1777-07-07 2012-09-28
But I want to avoid the outlier, My expected output is
b1a2 1777-07-07 2012-09-14 2019-09-24
b1a2 2012-09-14 2012-09-14 2019-09-24
b1a2 2017-09-14 2012-09-14 2019-09-24
b1a2 2019-09-24 2012-09-14 2019-09-24
b1a2 2012-09-14 2012-09-14 2019-09-24
b1a2 2012-09-14 2012-09-14 2019-09-24
b1a2 2012-09-28 2012-09-14 2019-09-24
a1l2 1777-07-07 2002-09-28 2012-09-28
a1l2 2012-09-24 2002-09-28 2012-09-28
a1l2 2012-09-28 2002-09-28 2012-09-28
a1l2 2002-09-28 2002-09-28 2012-09-28
a1l2 2012-09-24 2002-09-28 2012-09-28
a1l2 2008-09-14 2002-09-28 2012-09-28
a1l2 2012-09-24 2002-09-28 2012-09-28
Upvotes: 1
Views: 40
Reputation: 59529
With a constant outlier, mask before the groupby. Use transform
to broadcast back to the original DataFrame.
df['date'] = pd.to_datetime(df.date)
s = df.date.where(df.date.ne('1777-07-07')).groupby(df.col2)
df['earliest'] = s.transform('min')
df['latest'] = s.transform('max')
col2 date earliest latest
0 b1a2 1777-07-07 2012-09-14 2019-09-24
1 b1a2 2012-09-14 2012-09-14 2019-09-24
2 b1a2 1777-07-07 2012-09-14 2019-09-24
3 b1a2 1777-07-07 2012-09-14 2019-09-24
4 b1a2 2017-09-14 2012-09-14 2019-09-24
5 b1a2 2019-09-24 2012-09-14 2019-09-24
6 b1a2 2012-09-14 2012-09-14 2019-09-24
7 b1a2 2012-09-14 2012-09-14 2019-09-24
8 b1a2 2012-09-28 2012-09-14 2019-09-24
9 a1l2 1777-07-07 2002-09-28 2012-09-24
10 a1l2 2012-09-24 2002-09-28 2012-09-24
11 a1l2 2012-09-24 2002-09-28 2012-09-24
12 a1l2 2002-09-28 2002-09-28 2012-09-24
13 a1l2 2012-09-24 2002-09-28 2012-09-24
14 a1l2 2008-09-14 2002-09-28 2012-09-24
15 a1l2 2012-09-24 2002-09-28 2012-09-24
Upvotes: 2
Reputation: 51175
Mask invalid values and continue as you were before.
u = df['date'].mask(df['date'].eq('1777-07-07')).groupby(df['col2']).agg(['min', 'max'])
df.merge(u, left_on='col2', right_index=True)
col2 date min max
0 b1a2 1777-07-07 2012-09-14 2019-09-24
1 b1a2 2012-09-14 2012-09-14 2019-09-24
2 b1a2 1777-07-07 2012-09-14 2019-09-24
3 b1a2 1777-07-07 2012-09-14 2019-09-24
4 b1a2 2017-09-14 2012-09-14 2019-09-24
5 b1a2 2019-09-24 2012-09-14 2019-09-24
6 b1a2 2012-09-14 2012-09-14 2019-09-24
7 b1a2 2012-09-14 2012-09-14 2019-09-24
8 b1a2 2012-09-28 2012-09-14 2019-09-24
9 a1l2 1777-07-07 2002-09-28 2012-09-24
10 a1l2 2012-09-24 2002-09-28 2012-09-24
11 a1l2 2012-09-24 2002-09-28 2012-09-24
12 a1l2 2002-09-28 2002-09-28 2012-09-24
13 a1l2 2012-09-24 2002-09-28 2012-09-24
14 a1l2 2008-09-14 2002-09-28 2012-09-24
15 a1l2 2012-09-24 2002-09-28 2012-09-24
Upvotes: 1