StormsEdge
StormsEdge

Reputation: 885

Issue with Groupby function in Pandas

I am trying to take a dataframe, which has timestamps and various other fields, and group by rounded timestamps (to the nearest minute), and take the average of another field. I'm getting the error no numeric value to aggregate

I am rounding the timestamp column like such:

df['time'] = df['time'].dt.round('1min')

The aggregation column is of the form: 0 days 00:00:00.054000

df3 = (
    df
    .groupby([df['time']])['total_diff'].mean()
    .unstack(fill_value=0)
    .reset_index()
)

I realize the total_diff column is a time delta field, but I would have thought this would still be considered numerical?

My ideal output would have the following columns: Rounded timestamp, number of records that are grouped in that timestamp, average total_diff by rounded timestamp. How can I achieve this?

EDIT Example row:

[index, id, time, total_diff]
[400, 5fdfe9242c2fb0da04928d55, 2020-12-21 00:16:00, 0 days 00:00:00.055000]
[401, 5fdfe9242c2fb0da04928d56, 2020-12-21 00:16:00, 0 days 00:00:00.01000]
[402, 5fdfe9242c2fb0da04928d57, 2020-12-21 00:15:00, 0 days 00:00:00.05000]

The time column is not unique. I want to group by the time column, count the number of rows that are grouped into each time bucket, and produce an average of the total_diff for each time bucket.

Desired outcome:

[time, count, avg_total_diff]
[2020-12-21 00:16:00, 2, .0325]

Upvotes: 0

Views: 6901

Answers (1)

ALollz
ALollz

Reputation: 59579

By default DataFrame.groupby.mean has numeric_only=True, and numeric only considers int, bool and float. To also work with timedelta64[ns] you must set this to False .

Sample Data

import pandas as pd

df = pd.DataFrame(pd.date_range('2010-01-01', freq='2T', periods=6))
df[1] = df[0].diff().bfill()
#                    0               1
#0 2010-01-01 00:00:00 0 days 00:02:00
#1 2010-01-01 00:02:00 0 days 00:02:00
#2 2010-01-01 00:04:00 0 days 00:02:00
#3 2010-01-01 00:06:00 0 days 00:02:00
#4 2010-01-01 00:08:00 0 days 00:02:00
#5 2010-01-01 00:10:00 0 days 00:02:00

df.dtypes
#0     datetime64[ns]
#1    timedelta64[ns]
#dtype: object

Code

df.groupby(df[0].round('5T'))[1].mean()
#DataError: No numeric types to aggregate

df.groupby(df[0].round('5T'))[1].mean(numeric_only=False)
#0
#2010-01-01 00:00:00   0 days 00:02:00
#2010-01-01 00:05:00   0 days 00:02:00
#2010-01-01 00:10:00   0 days 00:02:00
#Name: 1, dtype: timedelta64[ns]

Upvotes: 3

Related Questions