andrew_reece
andrew_reece

Reputation: 21274

Pandas groupby datetime function does not preserve dtype

I'm having trouble extracting the .minute property of a Pandas datetime object, in the context of a groupby aggregation.

This post appears to touch on the same root issue, but the accepted answer just explained why the problem was happening (which is fair, as the OP only asked to understand the problem). I'm posting now because I'm hoping to find a solution that doesn't rely on explicitly changing the type of the data I'm aggregating.

Here's some example code:

import pandas as pd
ids = ['a','a','b','b']
dates = ['2017-01-01 01:01:00','2017-01-01 01:02:00',
         '2017-03-03 01:03:00','2017-03-03 01:04:00']
dates = pd.to_datetime(pd.Series(dates))
df = pd.DataFrame({'id':ids, 'datetime':dates})

  id             datetime     
0  a  2017-01-01 01:01:00
1  a  2017-01-01 01:02:00
2  b  2017-03-03 01:03:00
3  b  2017-03-03 01:04:00

My goal is to group by id, and then extract the minute, as an integer value, of the earliest timestamp in each datetime group.

For example, to do this across all datetime values, this works:

df.datetime.min().minute  # returns 1

I want to mimic that same functionality in a groupby() setting.
Combining min() and .minute in a UDF, however, the minute value gets tacked onto the end of a datetime that marks the start of the Unix epoch:

def get_earliest_minute(tstamps):
    return tstamps.min().minute

df.groupby('id').agg({'datetime':get_earliest_minute})

                         datetime
id  
 a  1970-01-01 00:00:00.000000001
 b  1970-01-01 00:00:00.000000003

The type returned from get_earliest_minute() is an integer:

def get_earliest_minute(tstamps):
    return type(tstamps.min().minute)

df.groupby('id').agg({'datetime':get_earliest_minute})

        datetime
id  
 a  <type 'int'>
 b  <type 'int'>

But the type of datetime, post-aggregation, is <M8[ns]:

df.groupby('id').agg({'datetime':get_earliest_minute}).datetime.dtype # dtype('<M8[ns]')

This answer to the post linked above states that this happens because of a purposeful type coercion, which tries to maintain the type of the original Series object that underwent aggregation. I've looked around a bit but couldn't find any solutions, other than one comment that suggested changing the type of the field to object before performing groupby(), e.g.,

df.datetime = df.datetime.astype(object)
df.groupby('id').agg({'datetime':get_earliest_minute})

and another comment which proposed to convert the output of the function to float before returning, e.g.,

def get_earliest_minute(tstamps):
    return float(tstamps.min().minute)

These workarounds do the job (although for some reason declaring int() does not escape type coercion like float() does), but is there a way to do these groupby manipulations on datetime objects without inserting explicit type conversions (i.e., either generalizing <M8[ns]->object or converting int->float)? In particular, in a case in where multiple agg() functions are applied to datetime, with some functions which rely on datetime attributes and some which don't, this wouldn't succeed with a pre-groupby conversion.

Also, is there a reason why float() type conversion overrides the built-in type coercion, but int() does not?
Thanks in advance!

Upvotes: 1

Views: 1438

Answers (1)

piRSquared
piRSquared

Reputation: 294468

I'm going to stick with @Jeff on this one. agg is doing what we all want. It is attempting to preserve the dtype because it is intended to aggregate the values of a particular dtype. And when I aggregate data of a particular dtype I expect that same dtype back...

...That said, you can very easily work around this with apply

your problem

def get_earliest_minute(tstamps):
    return tstamps.min().minute

df.groupby('id').agg({'datetime':get_earliest_minute})

                        datetime
id                              
a  1970-01-01 00:00:00.000000001
b  1970-01-01 00:00:00.000000003

workaround

def get_earliest_minute(tstamps):
    return tstamps.min().minute

df.groupby('id').datetime.apply(get_earliest_minute)

id
a    1
b    3
Name: datetime, dtype: int64

Upvotes: 1

Related Questions