Reputation: 878
I have a dataframe with 2 columns ['startdt'] and ['enddt']. They are datetime objects in a PANDAs dataframe. I'd like to create a new column which is grouped according to each combination of 'startdt' and 'enddt', and is filled with values down rows of the column, with 10 minute increment values from the 'startdt' up to/including 'enddt' columns.
Here is an example of 2 GROUPINGS of 'startdt' and 'enddt'; NOTE that the last row in the grouping will almost always be less than 10 minutes, as it must include and be bound by the value of 'enddt', to capture the full range between 'startdt' and 'enddt'.
startdt endt newcol
2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:40:00
2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:50:00
2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:55:00
2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:04:00
2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:14:00
2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:24:00
2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:33:00
'newcol' will create duplicate rows of the other 2 columns in the df, obviously, but will be unique rows with 10(or less in last row of group) minute increments between those 'startdt' and 'enddt' columns
Upvotes: 0
Views: 512
Reputation: 7058
def transform_func(row, freq, include_last):
start = row['startdt'].min()
end = row['endt'].max()
idx = pd.DatetimeIndex(start=start, end=end, freq=freq)
if include_last and idx[-1] != end:
idx = idx.append(pd.DatetimeIndex([end]))
return pd.DataFrame(data={'newcol': idx})
This takes the start
and end
, and makes a DatetimeIndex
based on it. You can specify the frequency and whether the interval should be closed
You can do this by either making the original transfrom_func
less generic and versatile, functools.partial
or a lambda
. I choose the lambda
transform_func10 = lambda x: transform_func(x, freq='10Min', include_last=True)
Aggregate using this specific function
d = df.groupby(['startdt', 'endt']).agg(transform_func10)
result
newcol
startdt endt
2017-05-27 11:30:00 2017-05-27 11:55:00 0 2017-05-27 11:30:00
1 2017-05-27 11:40:00
2 2017-05-27 11:50:00
3 2017-05-27 11:55:00
2017-05-27 14:54:00 2017-05-27 15:33:00 0 2017-05-27 14:54:00
1 2017-05-27 15:04:00
2 2017-05-27 15:14:00
3 2017-05-27 15:24:00
4 2017-05-27 15:33:00
d.reset_index().drop('level_2', axis=1).rename(columns={0: 'newcol'})
yields:
startdt endt newcol
0 2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:30:00
1 2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:40:00
2 2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:50:00
3 2017-05-27 11:30:00 2017-05-27 11:55:00 2017-05-27 11:55:00
4 2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 14:54:00
5 2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:04:00
6 2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:14:00
7 2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:24:00
8 2017-05-27 14:54:00 2017-05-27 15:33:00 2017-05-27 15:33:00
Upvotes: 2