Apostolos
Apostolos

Reputation: 8101

Bin in 10min samples

I have a pandas dataframe that consists of the following columns

col1, col2, _time

_time column is datetime object of when the row occured in time.

I want to resample my dataframe in 10min periods groupby both columns and aggregate the number of rows for each group that occured for every 10min period of time. I want the produced dataframe to have the following columns

col1 col2 since until count

Where since is the begining of each 10min period until the finish of each 10min period of time and count the number of rows that where found on the initial dataframe something like

col1  col2          since                  until         count
1      1       08/12/2017 12:00      08/12/2017 12:10       10
1      2       08/12/2017 12:00      08/12/2017 12:10        5
1      1       08/12/2017 12:10      08/12/2017 12:20        3

is this possible with the resample method of dataframes?

Upvotes: 1

Views: 611

Answers (2)

Anton vBR
Anton vBR

Reputation: 18916

This example can possibly help you in some way if you are still looking for an answer.

import pandas as pd
import numpy as np
import datetime

# create some random data
df = pd.DataFrame(columns=["col1","col2","timestamp"])
df.col1 = np.random.randint(100, size = 10)
df.col2 = np.random.randint(100, size = 10)
df.timestamp = [datetime.datetime(2000,1,1) + \
            datetime.timedelta(hours=int(i)) for i in np.random.randint(100, size = 10)]

# sort data by timestamp and reset index
df = df.sort_values(by="timestamp").reset_index(drop=True)

# create the bins by taking last first time and last time with freq 6h
bins = pd.date_range(start=df.timestamp.values[0],end=df.timestamp.values[-1], freq="6h") # change to reasonable freq (d, h, m, s) 
# zip them to pairs
startend =  list(zip(bins, bins.shift(1)))

# define a function that finds bin index
def time_in_range(x):
    """Return true if x is in the range [start, end]"""
    for ind,(start,end) in enumerate(startend):
        if start <= x <= end:
            return ind


# Add bin index to column named index
df['index'] = df.timestamp.apply(time_in_range)
# groupby index to find sum and count
df = df.groupby('index')["col1","col2"].agg(['sum','count']).reset_index()


# Create output df2 (with bins)        
df2 = pd.DataFrame(startend, columns=["start","end"]).reset_index()

# Join the two dataframes with column index
df3 =pd.merge(df2, df, how='outer', on='index').fillna(0)

# Final adjustments
df3.columns = ["index","start","end","col1","delete","col2","count"]
df3.drop(['delete','index'], axis=1, inplace=True)

Outputs:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>start</th>      <th>end</th>      <th>col1</th>      <th>col2</th>      <th>count</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>2000-01-01 21:00:00</td>      <td>2000-01-02 03:00:00</td>      <td>89.0</td>      <td>136.0</td>      <td>2.0</td>    </tr>    <tr>      <th>1</th>      <td>2000-01-02 03:00:00</td>      <td>2000-01-02 09:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>2</th>      <td>2000-01-02 09:00:00</td>      <td>2000-01-02 15:00:00</td>      <td>69.0</td>      <td>27.0</td>      <td>1.0</td>    </tr>    <tr>      <th>3</th>      <td>2000-01-02 15:00:00</td>      <td>2000-01-02 21:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>4</th>      <td>2000-01-02 21:00:00</td>      <td>2000-01-03 03:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>5</th>      <td>2000-01-03 03:00:00</td>      <td>2000-01-03 09:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>6</th>      <td>2000-01-03 09:00:00</td>      <td>2000-01-03 15:00:00</td>      <td>108.0</td>      <td>57.0</td>      <td>2.0</td>    </tr>    <tr>      <th>7</th>      <td>2000-01-03 15:00:00</td>      <td>2000-01-03 21:00:00</td>      <td>35.0</td>      <td>85.0</td>      <td>2.0</td>    </tr>    <tr>      <th>8</th>      <td>2000-01-03 21:00:00</td>      <td>2000-01-04 03:00:00</td>      <td>102.0</td>      <td>92.0</td>      <td>2.0</td>    </tr>    <tr>      <th>9</th>      <td>2000-01-04 03:00:00</td>      <td>2000-01-04 09:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>10</th>      <td>2000-01-04 09:00:00</td>      <td>2000-01-04 15:00:00</td>      <td>0.0</td>      <td>0.0</td>      <td>0.0</td>    </tr>    <tr>      <th>11</th>      <td>2000-01-04 15:00:00</td>      <td>2000-01-04 21:00:00</td>      <td>91.0</td>      <td>3.0</td>      <td>1.0</td>    </tr>  </tbody></table>

Upvotes: 0

thorbjornwolf
thorbjornwolf

Reputation: 1848

I too have previously been looking at resample for this, to no avail. Luckily, I found a solution using pd.Series.dt.floor!

  1. Use .dt.floor to align your timestamps to 10-minute intervals,
  2. Use the resulting object in a groupby (or, optionally, assign it to a column in your source data, and use the column)
  3. Use pd.to_timedelta to calculate the until column from your since column

For instance,

import pandas as pd

interval = '10min'  # 10 minutes intervals, please

# Dummy data with 3-minute intervals
data = pd.DataFrame({
    'col1': [0, 0, 1, 0, 0, 0, 1, 0, 1, 1], 
    'col2': [4, 4, 4, 3, 4, 4, 3, 3, 4, 4], 
    '_time': pd.date_range(start='2010-01-01 00:01:00', freq='3min', periods=10),
})

# Floor the timestamps to your desired interval
since = data['_time'].dt.floor(interval).rename('since')

# Get the size of each group - groups are in the index of `agg`
agg = data.groupby(['col1', 'col2', since]).size()
agg = agg.rename('count')

# Back to dataframe
agg = agg.reset_index()

# Simply add your interval to `since`
agg['until'] = agg['since'] + pd.to_timedelta(interval)

print(agg)

   col1  col2               since  count               until
0     0     3 2010-01-01 00:10:00      1 2010-01-01 00:20:00
1     0     3 2010-01-01 00:20:00      1 2010-01-01 00:30:00
2     0     4 2010-01-01 00:00:00      2 2010-01-01 00:10:00
3     0     4 2010-01-01 00:10:00      2 2010-01-01 00:20:00
4     1     3 2010-01-01 00:10:00      1 2010-01-01 00:20:00
5     1     4 2010-01-01 00:00:00      1 2010-01-01 00:10:00
6     1     4 2010-01-01 00:20:00      2 2010-01-01 00:30:00

Upvotes: 2

Related Questions