Reputation: 8101
I have a pandas dataframe that consists of the following columns
col1, col2, _time
_time
column is datetime object of when the row occured in time.
I want to resample my dataframe in 10min periods groupby both columns and aggregate the number of rows for each group that occured for every 10min period of time. I want the produced dataframe to have the following columns
col1 col2 since until count
Where since
is the begining of each 10min period until
the finish of each 10min period of time and count the number of rows that where found on the initial dataframe something like
col1 col2 since until count
1 1 08/12/2017 12:00 08/12/2017 12:10 10
1 2 08/12/2017 12:00 08/12/2017 12:10 5
1 1 08/12/2017 12:10 08/12/2017 12:20 3
is this possible with the resample method of dataframes?
Upvotes: 1
Views: 611
Reputation: 18916
This example can possibly help you in some way if you are still looking for an answer.
import pandas as pd
import numpy as np
import datetime
# create some random data
df = pd.DataFrame(columns=["col1","col2","timestamp"])
df.col1 = np.random.randint(100, size = 10)
df.col2 = np.random.randint(100, size = 10)
df.timestamp = [datetime.datetime(2000,1,1) + \
datetime.timedelta(hours=int(i)) for i in np.random.randint(100, size = 10)]
# sort data by timestamp and reset index
df = df.sort_values(by="timestamp").reset_index(drop=True)
# create the bins by taking last first time and last time with freq 6h
bins = pd.date_range(start=df.timestamp.values[0],end=df.timestamp.values[-1], freq="6h") # change to reasonable freq (d, h, m, s)
# zip them to pairs
startend = list(zip(bins, bins.shift(1)))
# define a function that finds bin index
def time_in_range(x):
"""Return true if x is in the range [start, end]"""
for ind,(start,end) in enumerate(startend):
if start <= x <= end:
return ind
# Add bin index to column named index
df['index'] = df.timestamp.apply(time_in_range)
# groupby index to find sum and count
df = df.groupby('index')["col1","col2"].agg(['sum','count']).reset_index()
# Create output df2 (with bins)
df2 = pd.DataFrame(startend, columns=["start","end"]).reset_index()
# Join the two dataframes with column index
df3 =pd.merge(df2, df, how='outer', on='index').fillna(0)
# Final adjustments
df3.columns = ["index","start","end","col1","delete","col2","count"]
df3.drop(['delete','index'], axis=1, inplace=True)
Outputs:
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>start</th> <th>end</th> <th>col1</th> <th>col2</th> <th>count</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>2000-01-01 21:00:00</td> <td>2000-01-02 03:00:00</td> <td>89.0</td> <td>136.0</td> <td>2.0</td> </tr> <tr> <th>1</th> <td>2000-01-02 03:00:00</td> <td>2000-01-02 09:00:00</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>2</th> <td>2000-01-02 09:00:00</td> <td>2000-01-02 15:00:00</td> <td>69.0</td> <td>27.0</td> <td>1.0</td> </tr> <tr> <th>3</th> <td>2000-01-02 15:00:00</td> <td>2000-01-02 21:00:00</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>4</th> <td>2000-01-02 21:00:00</td> <td>2000-01-03 03:00:00</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>5</th> <td>2000-01-03 03:00:00</td> <td>2000-01-03 09:00:00</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>6</th> <td>2000-01-03 09:00:00</td> <td>2000-01-03 15:00:00</td> <td>108.0</td> <td>57.0</td> <td>2.0</td> </tr> <tr> <th>7</th> <td>2000-01-03 15:00:00</td> <td>2000-01-03 21:00:00</td> <td>35.0</td> <td>85.0</td> <td>2.0</td> </tr> <tr> <th>8</th> <td>2000-01-03 21:00:00</td> <td>2000-01-04 03:00:00</td> <td>102.0</td> <td>92.0</td> <td>2.0</td> </tr> <tr> <th>9</th> <td>2000-01-04 03:00:00</td> <td>2000-01-04 09:00:00</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>10</th> <td>2000-01-04 09:00:00</td> <td>2000-01-04 15:00:00</td> <td>0.0</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>11</th> <td>2000-01-04 15:00:00</td> <td>2000-01-04 21:00:00</td> <td>91.0</td> <td>3.0</td> <td>1.0</td> </tr> </tbody></table>
Upvotes: 0
Reputation: 1848
I too have previously been looking at resample
for this, to no avail.
Luckily, I found a solution using pd.Series.dt.floor
!
.dt.floor
to align your timestamps to 10-minute intervals, pd.to_timedelta
to calculate the until
column from your since
columnFor instance,
import pandas as pd
interval = '10min' # 10 minutes intervals, please
# Dummy data with 3-minute intervals
data = pd.DataFrame({
'col1': [0, 0, 1, 0, 0, 0, 1, 0, 1, 1],
'col2': [4, 4, 4, 3, 4, 4, 3, 3, 4, 4],
'_time': pd.date_range(start='2010-01-01 00:01:00', freq='3min', periods=10),
})
# Floor the timestamps to your desired interval
since = data['_time'].dt.floor(interval).rename('since')
# Get the size of each group - groups are in the index of `agg`
agg = data.groupby(['col1', 'col2', since]).size()
agg = agg.rename('count')
# Back to dataframe
agg = agg.reset_index()
# Simply add your interval to `since`
agg['until'] = agg['since'] + pd.to_timedelta(interval)
print(agg)
col1 col2 since count until
0 0 3 2010-01-01 00:10:00 1 2010-01-01 00:20:00
1 0 3 2010-01-01 00:20:00 1 2010-01-01 00:30:00
2 0 4 2010-01-01 00:00:00 2 2010-01-01 00:10:00
3 0 4 2010-01-01 00:10:00 2 2010-01-01 00:20:00
4 1 3 2010-01-01 00:10:00 1 2010-01-01 00:20:00
5 1 4 2010-01-01 00:00:00 1 2010-01-01 00:10:00
6 1 4 2010-01-01 00:20:00 2 2010-01-01 00:30:00
Upvotes: 2