Reputation: 21274
I'm having trouble extracting the .minute
property of a Pandas datetime object, in the context of a groupby
aggregation.
This post appears to touch on the same root issue, but the accepted answer just explained why the problem was happening (which is fair, as the OP only asked to understand the problem). I'm posting now because I'm hoping to find a solution that doesn't rely on explicitly changing the type of the data I'm aggregating.
Here's some example code:
import pandas as pd
ids = ['a','a','b','b']
dates = ['2017-01-01 01:01:00','2017-01-01 01:02:00',
'2017-03-03 01:03:00','2017-03-03 01:04:00']
dates = pd.to_datetime(pd.Series(dates))
df = pd.DataFrame({'id':ids, 'datetime':dates})
id datetime
0 a 2017-01-01 01:01:00
1 a 2017-01-01 01:02:00
2 b 2017-03-03 01:03:00
3 b 2017-03-03 01:04:00
My goal is to group by id
, and then extract the minute, as an integer value, of the earliest timestamp in each datetime
group.
For example, to do this across all datetime
values, this works:
df.datetime.min().minute # returns 1
I want to mimic that same functionality in a groupby()
setting.
Combining min()
and .minute
in a UDF, however, the minute value gets tacked onto the end of a datetime that marks the start of the Unix epoch:
def get_earliest_minute(tstamps):
return tstamps.min().minute
df.groupby('id').agg({'datetime':get_earliest_minute})
datetime
id
a 1970-01-01 00:00:00.000000001
b 1970-01-01 00:00:00.000000003
The type returned from get_earliest_minute()
is an integer:
def get_earliest_minute(tstamps):
return type(tstamps.min().minute)
df.groupby('id').agg({'datetime':get_earliest_minute})
datetime
id
a <type 'int'>
b <type 'int'>
But the type of datetime
, post-aggregation, is <M8[ns]
:
df.groupby('id').agg({'datetime':get_earliest_minute}).datetime.dtype # dtype('<M8[ns]')
This answer to the post linked above states that this happens because of a purposeful type coercion, which tries to maintain the type of the original Series object that underwent aggregation. I've looked around a bit but couldn't find any solutions, other than one comment that suggested changing the type of the field to object
before performing groupby()
, e.g.,
df.datetime = df.datetime.astype(object)
df.groupby('id').agg({'datetime':get_earliest_minute})
and another comment which proposed to convert the output of the function to float
before returning, e.g.,
def get_earliest_minute(tstamps):
return float(tstamps.min().minute)
These workarounds do the job (although for some reason declaring int()
does not escape type coercion like float()
does), but is there a way to do these groupby
manipulations on datetime objects without inserting explicit type conversions (i.e., either generalizing <M8[ns]
->object
or converting int
->float
)? In particular, in a case in where multiple agg()
functions are applied to datetime
, with some functions which rely on datetime attributes and some which don't, this wouldn't succeed with a pre-groupby conversion.
Also, is there a reason why float()
type conversion overrides the built-in type coercion, but int()
does not?
Thanks in advance!
Upvotes: 1
Views: 1438
Reputation: 294468
I'm going to stick with @Jeff on this one. agg
is doing what we all want. It is attempting to preserve the dtype
because it is intended to aggregate the values of a particular dtype. And when I aggregate data of a particular dtype
I expect that same dtype
back...
...That said, you can very easily work around this with apply
your problem
def get_earliest_minute(tstamps):
return tstamps.min().minute
df.groupby('id').agg({'datetime':get_earliest_minute})
datetime
id
a 1970-01-01 00:00:00.000000001
b 1970-01-01 00:00:00.000000003
workaround
def get_earliest_minute(tstamps):
return tstamps.min().minute
df.groupby('id').datetime.apply(get_earliest_minute)
id
a 1
b 3
Name: datetime, dtype: int64
Upvotes: 1