fede72bari
fede72bari

Reputation: 543

Pandas how to group by a column and minutes ranges excluding date

I have to elaborate simple statistics grouping a dataframe by one column, for instance day_of_the_week, and minutes ranges, for instance 15 minutess, without keeping the dates. In other words, I need statistics on what is happening in each interval of 15 minutes on all sundays, mondays, etc. not split by date. The starting dataframe is something like this

    datetime        high low Day_of_the_week HL_delta               

2021-08-01 22:00:00 4403.00 4395.25 6.0 7.75

2021-08-01 22:15:00 4404.00 4401.00 6.0 3.00

2021-08-01 22:30:00 4409.00 4403.25 6.0 5.75

2021-08-01 22:45:00 4408.25 4406.25 6.0 2.00

2021-08-01 23:00:00 4408.25 4405.75 6.0 2.5

where datetime is the index of the dataframe and it is a DateTime type.

I need to calculate the mean and max value for each distinguished day of the week of HL_delta grouped by 15 minute ranges over 1 year data.

I have tried something like this

df_statistics['HL_mean'] = df_data_for_statistics.groupby([df_data_for_statistics['day_of_the_week'], pd.Grouper(freq=statistics_period_minutes_formatted,closed='left',label='left')]).agg({ "HL_delta":  "mean"})

df_statistics['HL_max'] = df_data_for_statistics.groupby([df_data_for_statistics['day_of_the_week'], pd.Grouper(freq=statistics_period_minutes_formatted,closed='left',label='left')]).agg({ "HL_delta":  "max"})

but what i get is not an aggregation on all the distinguished weekdays of the year, the aggregation is applied on group of 15 minutes of each date, not each monday, Tuesday, Wednesday,.... The statistics shall answer to the questions: "which is the max values of HL_delta between the time 00:00 and 00:15 of all the Mondays of the year", "which is the max values of HL_delta between the time 00:16 and 00:30 of all the Mondays of the year", ..., "which is the max values of HL_delta between the time 00:00 and 00:15 of all the Fridays of the year", ... etc. Instead what I get by this attempt is this

                    high    low day_of_the_week HL_delta

datetime                

2021-08-01 22:00:00 4403.00 4395.25 6.0 7.75

2021-08-01 22:15:00 4404.00 4401.00 6.0 3.00

2021-08-01 22:30:00 4409.00 4403.25 6.0 5.75

2021-08-01 22:45:00 4408.25 4406.25 6.0 2.00

2021-08-01 23:00:00 4408.25 4405.75 6.0 2.50

... ... ... ... ...

2022-03-21 22:45:00 4453.50 4451.50 0.0 2.00

2022-03-21 23:00:00 4452.25 4449.00 0.0 3.25

2022-03-21 23:15:00 4451.50 4449.25 0.0 2.25

2022-03-21 23:30:00 4451.50 4448.50 0.0 3.00

2022-03-21 23:45:00 4449.75 4445.25 0.0 4.50

Any suggestion?

Upvotes: 1

Views: 197

Answers (1)

Laurent
Laurent

Reputation: 13458

With the following toy dataframe:

import random
import pandas as pd

list_of_dates = pd.date_range(start="2021-1-1", end="2021-12-31", freq="T")
df = pd.DataFrame(
    {
        "datetime": list_of_dates,
        "low": [random.randint(0, 4_999) for _ in range(len(list_of_dates))],
        "high": [random.randint(5_000, 9_999) for _ in range(len(list_of_dates))],
    }
)
df["HL_delta"] = df["high"] - df["low"]

print(df)
# Output
             datetime   low  high  HL_delta
0 2021-01-01 00:00:00  4325  5059       734
1 2021-01-01 00:01:00   917  7224      6307
2 2021-01-01 00:02:00  2956  7804      4848
3 2021-01-01 00:03:00  1329  8056      6727
4 2021-01-01 00:04:00  1721  9144      7423
...

Here is one way to do it:

# Setup
df["weekday"] = df["datetime"].dt.day_name()
df["datetime_minute"] = df["datetime"].dt.minute
intervals = {"0-15": [0, 15], "16-30": [16, 30], "31-45": [31, 45], "46-59": [46, 59]}

# Find intervals
df["interval"] = df.apply(
    lambda x: next(
        filter(
            None,
            [
                key
                if x["datetime_minute"] >= intervals[key][0]
                and x["datetime_minute"] <= intervals[key][1]
                else None
                for key in intervals.keys()
            ],
        )
    ),
    axis=1,
)

# Get stats
new_df = (
    df.drop(columns=["low", "high", "datetime", "datetime_minute"])
    .groupby(["weekday", "interval"], sort=False)
    .agg(["max", "mean"])
)

And so:

print(new_df)
# Output
                   HL_delta
                        max         mean
weekday   interval
Friday    0-15         9989  5011.461666
          16-30        9948  5003.452724
          31-45        9902  4969.810577
          46-59        9926  5007.599073
Saturday  0-15         9950  5004.103966
          16-30        9961  4984.479327
          31-45        9954  5005.854647
          46-59        9973  5011.797447
Sunday    0-15         9979  4994.012270
          16-30        9950  4981.940438
          31-45        9877  5009.572276
          46-59        9930  5020.719609
Monday    0-15         9974  4963.538812
          16-30        9918  4977.481090
          31-45        9971  4977.858173
          46-59        9958  4992.886733
Tuesday   0-15         9924  5014.045623
          16-30        9966  4990.358547
          31-45        9948  4993.595566
          46-59        9948  5000.271120
Wednesday 0-15         9975  4998.320463
          16-30        9976  4981.763889
          31-45        9981  4981.806303
          46-59        9995  5001.579670
Thursday  0-15         9958  5015.276643
          16-30        9900  4996.797489
          31-45        9949  4991.088034
          46-59        9948  4980.678457

Upvotes: 1

Related Questions