PR102012
PR102012

Reputation: 878

PANDAs: create 'filled' column with incremental datetime values, in between 2 datetimes (range) columns

I have a dataframe with 2 columns ['startdt'] and ['enddt']. They are datetime objects in a PANDAs dataframe. I'd like to create a new column which is grouped according to each combination of 'startdt' and 'enddt', and is filled with values down rows of the column, with 10 minute increment values from the 'startdt' up to/including 'enddt' columns.

Here is an example of 2 GROUPINGS of 'startdt' and 'enddt'; NOTE that the last row in the grouping will almost always be less than 10 minutes, as it must include and be bound by the value of 'enddt', to capture the full range between 'startdt' and 'enddt'.

startdt             endt                newcol
2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:40:00
2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:50:00
2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:55:00
2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:04:00
2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:14:00
2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:24:00
2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:33:00

'newcol' will create duplicate rows of the other 2 columns in the df, obviously, but will be unique rows with 10(or less in last row of group) minute increments between those 'startdt' and 'enddt' columns

Upvotes: 0

Views: 512

Answers (1)

Maarten Fabré
Maarten Fabré

Reputation: 7058

Define a custom (generic) transform function

def transform_func(row, freq, include_last):
    start = row['startdt'].min()
    end = row['endt'].max()
    idx = pd.DatetimeIndex(start=start, end=end, freq=freq)
    if include_last and idx[-1] != end:
        idx = idx.append(pd.DatetimeIndex([end]))
    return pd.DataFrame(data={'newcol': idx})

This takes the start and end, and makes a DatetimeIndex based on it. You can specify the frequency and whether the interval should be closed

Create a specific transform function

You can do this by either making the original transfrom_func less generic and versatile, functools.partial or a lambda. I choose the lambda

transform_func10 = lambda x: transform_func(x, freq='10Min', include_last=True)

Do the aggregation

Aggregate using this specific function

d = df.groupby(['startdt', 'endt']).agg(transform_func10)

result

                                                    newcol
startdt                 endt         
2017-05-27 11:30:00     2017-05-27 11:55:00   0     2017-05-27 11:30:00
                                              1     2017-05-27 11:40:00
                                              2     2017-05-27 11:50:00
                                              3     2017-05-27 11:55:00
2017-05-27 14:54:00     2017-05-27 15:33:00   0     2017-05-27 14:54:00
                                              1     2017-05-27 15:04:00
                                              2     2017-05-27 15:14:00
                                              3     2017-05-27 15:24:00
                                              4     2017-05-27 15:33:00

Reformat

d.reset_index().drop('level_2', axis=1).rename(columns={0: 'newcol'}) yields:

    startdt                 endt                    newcol
0   2017-05-27 11:30:00     2017-05-27 11:55:00     2017-05-27 11:30:00
1   2017-05-27 11:30:00     2017-05-27 11:55:00     2017-05-27 11:40:00
2   2017-05-27 11:30:00     2017-05-27 11:55:00     2017-05-27 11:50:00
3   2017-05-27 11:30:00     2017-05-27 11:55:00     2017-05-27 11:55:00
4   2017-05-27 14:54:00     2017-05-27 15:33:00     2017-05-27 14:54:00
5   2017-05-27 14:54:00     2017-05-27 15:33:00     2017-05-27 15:04:00
6   2017-05-27 14:54:00     2017-05-27 15:33:00     2017-05-27 15:14:00
7   2017-05-27 14:54:00     2017-05-27 15:33:00     2017-05-27 15:24:00
8   2017-05-27 14:54:00     2017-05-27 15:33:00     2017-05-27 15:33:00

Upvotes: 2

Related Questions