diogenes
diogenes

Reputation: 2149

Pandas Groupby with Agg Min/Max date

I am trying to get the min/max dates from columns that might not have a date.

I have tried things like '', "NaN", True, False, 0, etc...

I first set the 'p_date' and 's_date' from the actual date based on the action of '+' or '-'.

 df_td [ 'p_date' ] = np.where ( df_td.action == '+', df_td.date, ??? )
 df_td [ 's_date' ] = np.where ( df_td.action == '-', df_td.date, ??? )

So I need the min of the p_date & max of the s_date here.

  issue      p_date      s_date
0  issue  2012-11-01            
1  issue  2013-12-09            
2  issue  2014-12-08            
3  issue              2016-01-13
4  issue  2012-11-01            
5  issue              2014-03-26
6  issue              2015-05-29
7  issue  2013-12-18            
8  issue              2016-01-13

I do a group by

g = df_td.groupby ( [ 'name', 'type' ], as_index = False ).agg (
     {  ...
        'p_date': 'min',
        's_date': 'max'
      } )

This gives me the following error.

'<=' not supported between instances of 'datetime.date' and 'str'

if I make up a date for the blank space it does work but this is not a great option.

What can I fill the date with to get it to work?

Thanks.

Upvotes: 1

Views: 7752

Answers (2)

DSM
DSM

Reputation: 353589

In pandas, NaN is used as the missing value, and is ignored for most operations, so it's the right one to use. If you're still getting an error, it's probably because you've got a datetime.date there (well, you've definitely got that there, I mean that it's probably causing the problems).

For example, if your missing values are "" and your column dtypes are object with internal types of datetime.date, I get:

In [496]: df.groupby("issue").agg({"p_date": "min", "s_date": "max"})
[...]
TypeError: '<=' not supported between instances of 'datetime.date' and 'str'

but if I switch to pandas-native time objects and NaNs, it works:

In [500]: df["p_date"] = pd.to_datetime(df["p_date"])

In [501]: df["s_date"] = pd.to_datetime(df["s_date"])

In [502]: df
Out[502]: 
   issue     p_date     s_date
0  issue 2012-11-01        NaT
1  issue 2013-12-09        NaT
2  issue 2014-12-08        NaT
3  issue        NaT 2016-01-13
4  issue 2012-11-01        NaT
5  issue        NaT 2014-03-26
6  issue        NaT 2015-05-29
7  issue 2013-12-18        NaT
8  issue        NaT 2016-01-13

In [503]: df.groupby("issue").agg({"p_date": "min", "s_date": "max"})
Out[503]: 
          p_date     s_date
issue                      
issue 2012-11-01 2016-01-13

Upvotes: 2

diogenes
diogenes

Reputation: 2149

I am not happy with my answer but it seems to work.

I created a floor and ceiling dates.

floor = datetime.date ( 1900, 1, 1 )
ceil = datetime.date ( 2100, 1, 1 )

df_td [ 'p_date' ] = np.where ( df_td.action == '+', df_td.date, ceil )
df_td [ 's_date' ] = np.where ( df_td.action == '-', df_td.date, floor )

That way they will offset the min/max aggregate function that need a data object to work.

still would like a more appropriate suggestion. Thanks.

Upvotes: 0

Related Questions