trock2000
trock2000

Reputation: 312

pandas resample uneven hourly data into 1D or 24h bins

I have weekly hourly FX data which I need to resample into '1D' or '24hr' bins Monday through Thursday 12:00pm and at 21:00 on Friday, totaling 5 days per week:

Date                 rate
2020-01-02 00:00:00 0.673355
2020-01-02 01:00:00 0.67311
2020-01-02 02:00:00 0.672925
2020-01-02 03:00:00 0.67224
2020-01-02 04:00:00 0.67198
2020-01-02 05:00:00 0.67223
2020-01-02 06:00:00 0.671895
2020-01-02 07:00:00 0.672175
2020-01-02 08:00:00 0.672085
2020-01-02 09:00:00 0.67087
2020-01-02 10:00:00 0.6705800000000001
2020-01-02 11:00:00 0.66884
2020-01-02 12:00:00 0.66946
2020-01-02 13:00:00 0.6701600000000001
2020-01-02 14:00:00 0.67056
2020-01-02 15:00:00 0.67124
2020-01-02 16:00:00 0.6691699999999999
2020-01-02 17:00:00 0.66883
2020-01-02 18:00:00 0.66892
2020-01-02 19:00:00 0.669345
2020-01-02 20:00:00 0.66959
2020-01-02 21:00:00 0.670175
2020-01-02 22:00:00 0.6696300000000001
2020-01-02 23:00:00 0.6698350000000001
2020-01-03 00:00:00 0.66957

So the number of hours in each some days of the week is uneven, ie "Monday" = 00:00:00 Monday through 12:00:00 Monday, "Tuesday" (and also Weds, Thu) = i.e. 13:00:00 Monday though 12:00:00 Tuesday, and Friday = 13:00:00 through 21:00:00

In trying to find a solution I see that base is now deprecated, and offset/origin methods aren't working as expected, likely due to uneven number of rows per day:

df.rate.resample('24h', offset=12).ohlc() 

I've spent hours attempting to find a solution

How can one simply bin into ohlc() columns all data rows between each 12:00:00 timestamp?

the desired output would look something like this:

Out[69]: 
                                   open      high       low     close
2020-01-02 00:00:00.0000000  0.673355  0.673355  0.673355  0.673355
2020-01-03 00:00:00.0000000  0.673110  0.673110  0.668830  0.669570
2020-01-04 00:00:00.0000000  0.668280  0.668280  0.664950  0.666395
2020-01-05 00:00:00.0000000  0.666425  0.666425  0.666425  0.666425

Upvotes: 1

Views: 1457

Answers (3)

trock2000
trock2000

Reputation: 312

I ended up using a combination of grouby and datetime day of the week identification to arrive at my specific solution

# get idxs of time to rebal (12:00:00)-------------------------------------
df['idx'] = range(len(df)) # get row index
days = [] # identify each row by day of week
for i in range(len(df.index)):
    days.append(df.index[i].date().weekday())
df['day'] = days

dtChgIdx = [] # stores "12:00:00" rows
justDates = df.index.date.tolist() # gets just dates
res = [] # removes duplicate dates
[res.append(x) for x in justDates if x not in res]
justDates = res
grouped_dates = df.groupby(df.index.date) # group entire df by dates

for i in range(len(grouped_dates)):
    tempDf = grouped_dates.get_group(justDates[i]) # look at each grouped dates
    if tempDf['day'][0] == 6:
        continue # skip Sundays
    times = [] # gets just the time portion of index
    for y in range(len(tempDf.index)):
        times.append(str(tempDf.index[y])[-8:])
    tempDf['time'] = times # add time column to df
    tempDf['dayCls'] = np.where(tempDf['time'] == '12:00:00',1,0) # idx "12:00:00" row  
    dtChgIdx.append(tempDf.loc[tempDf['dayCls'] == 1, 'idx'][0]) # idx value

Upvotes: 0

Bill Huang
Bill Huang

Reputation: 4648

Since the period lengths are unequal, IMO it is necessary to craft the mapping wheel yourself. Speaking precisely, the 1.5-day length on Monday makes it impossible for freq='D' to do the mapping correctly at once.

The hand-crafted code is also able to guard against records outside the well-defined periods.

Data

A slightly different timestamp is used to demonstrate the correctness of the code. The days are from Mon. to Fri.

import pandas as pd
import numpy as np
from datetime import datetime
import io
from pandas import Timestamp, Timedelta

df = pd.read_csv(io.StringIO("""
                         rate
Date                         
2020-01-06 00:00:00  0.673355
2020-01-06 23:00:00  0.673110
2020-01-07 00:00:00  0.672925
2020-01-07 12:00:00  0.672240
2020-01-07 13:00:00  0.671980
2020-01-07 23:00:00  0.672230
2020-01-08 00:00:00  0.671895
2020-01-08 12:00:00  0.672175
2020-01-08 23:00:00  0.672085
2020-01-09 00:00:00  0.670870
2020-01-09 12:00:00  0.670580
2020-01-09 23:00:00  0.668840
2020-01-10 00:00:00  0.669460
2020-01-10 12:00:00  0.670160
2020-01-10 21:00:00  0.670560
2020-01-10 22:00:00  0.671240
2020-01-10 23:00:00  0.669170
"""), sep=r"\s{2,}", engine="python")

df.set_index(pd.to_datetime(df.index), inplace=True)

Code

def find_day(ts: Timestamp):
    """Find the trading day with irregular length"""

    wd = ts.isoweekday()
    if wd == 1:
        return ts.date()
    elif wd in (2, 3, 4):
        return ts.date() - Timedelta("1D") if ts.hour <= 12 else ts.date()
    elif wd == 5:
        if ts.hour <= 12:
            return ts.date() - Timedelta("1D")
        elif 13 <= ts.hour <= 21:
            return ts.date()

    # out of range or nulls
    return None

# map the timestamps, and set as new index
df.set_index(pd.DatetimeIndex(df.index.map(find_day)), inplace=True)

# drop invalid values and collect ohlc
ans = df["rate"][df.index.notnull()].resample("D").ohlc()

Result

print(ans)

                open      high       low     close
Date                                              
2020-01-06  0.673355  0.673355  0.672240  0.672240
2020-01-07  0.671980  0.672230  0.671895  0.672175
2020-01-08  0.672085  0.672085  0.670580  0.670580
2020-01-09  0.668840  0.670160  0.668840  0.670160
2020-01-10  0.670560  0.670560  0.670560  0.670560

Upvotes: 1

Sander van den Oord
Sander van den Oord

Reputation: 12818

Is this what you are looking for, using both origin and offset as parameters:

df.resample('24h', origin='start_day', offset='13h').ohlc()

For your example, this gives me:

                    open        high        low     close
datetime                
2020-01-01 13:00:00 0.673355    0.673355    0.66884 0.66946
2020-01-02 13:00:00 0.670160    0.671240    0.66883 0.66957

Upvotes: 1

Related Questions