Rafael Barbosa
Rafael Barbosa

Reputation: 1190

Resample a 'tidy' dataframe with pandas

I have timestamped data with two columns of interest: a 'label' and a count. I would like to create a time series with the sums per label per, say, day. Can I use resample to achieve this?

Concrete example:

import pandas as pd
import numpy as np
from itertools import cycle

idx = pd.date_range('2016-01-01', '2016-01-07', freq='H')
n = np.random.randint(10, size=24*6+1)
lst = [(l,c) for l,c in zip(cycle(['foo', 'bar']), n)]
df = pd.DataFrame(lst, index=idx, columns=['label', 'n'])

df.resample(???).sum()

For this example, the target data frame should contain a time index and two columns (foo and bar) containing the total counts per interval.

Upvotes: 2

Views: 284

Answers (1)

jezrael
jezrael

Reputation: 862406

I think you need groupby with DataFrameGroupBy.resample:

print (df.groupby('label')
         .resample('1D')
         .sum()
         .reset_index()
         .rename(columns={'level_1':'date'}))

   label       date   n
0    bar 2016-01-01  44
1    bar 2016-01-02  60
2    bar 2016-01-03  65
3    bar 2016-01-04  51
4    bar 2016-01-05  37
5    bar 2016-01-06  59
6    foo 2016-01-01  40
7    foo 2016-01-02  69
8    foo 2016-01-03  58
9    foo 2016-01-04  55
10   foo 2016-01-05  67
11   foo 2016-01-06  59
12   foo 2016-01-07   5

Another solution with stack and unstack for working with datetimeindex:

print (df.set_index('label', append=True)
         .unstack(1)
         .resample('1D')
         .sum()
         .stack()
         .reset_index()
         .rename(columns={'level_0':'date'}))

         date label     n
0  2016-01-01   bar  44.0
1  2016-01-01   foo  40.0
2  2016-01-02   bar  60.0
3  2016-01-02   foo  69.0
4  2016-01-03   bar  65.0
5  2016-01-03   foo  58.0
6  2016-01-04   bar  51.0
7  2016-01-04   foo  55.0
8  2016-01-05   bar  37.0
9  2016-01-05   foo  67.0
10 2016-01-06   bar  59.0
11 2016-01-06   foo  59.0
12 2016-01-07   foo   5.0

If need two columns:

df1 = df.set_index('label', append=True).unstack(1).resample('1D').sum()
df1.columns = df1.columns.droplevel(0)
print (df1)
label        bar   foo
2016-01-01  61.0  65.0
2016-01-02  54.0  56.0
2016-01-03  70.0  53.0
2016-01-04  46.0  49.0
2016-01-05  61.0  49.0
2016-01-06  50.0  55.0
2016-01-07   NaN   6.0

Upvotes: 2

Related Questions