Baron Yugovich
Baron Yugovich

Reputation: 4307

Pandas - accumulate values within certain time interval

Suppose I have

df

                          ts  v
0 2014-11-03 03:39:42.200914  1
1 2014-11-03 03:39:42.500914  2
2 2014-11-03 03:39:43.600914  3
3 2014-11-03 03:39:43.620914  4

I want to append a column s, such that in every row, it contains the sum of vs within a 1-second lookback time interval, such as

desired_df
   s                         ts  v
0  1 2014-11-03 03:39:42.200914  1
1  3 2014-11-03 03:39:42.500914  2
2  3 2014-11-03 03:39:43.600914  3
3  7 2014-11-03 03:39:43.620914  4

So, how do I generate this additional column s?

Noe that the intervals should be overlapping, but they must have right endpoint at every row(datapoint) in the dataframe, i.e. every row(datapoint) in my dataframe must be a right endpoint of an interval of that size.

EDIT: THE ANSWER BELOW IS NOT RIGHT?COMPLETE

EDIT: I would like the solution to work for a generic time interval, such as 14ms, not necessarily just for 1 second.

How about

df['s'] = df.groupby(pd.TimeGrouper(freq='400mS')).transform(numpy.cumsum)

I got

                          ts  v  s
0 2014-11-03 03:39:42.200914  1  1
1 2014-11-03 03:39:42.500914  2  2
2 2014-11-03 03:39:43.600914  3  3
3 2014-11-03 03:39:43.620914  4  7

Isn't the row indexed 1 wrong? s at 03:39:42.500914 should be 2+1=3. not 2, because the first 2 rows are within 400ms, so it should add them together. Why is it not doing that?

EDIT: When I try

df['s'] = df.groupby(pd.TimeGrouper(freq='340mS')).transform(numpy.cumsum)

It actually performs that grouping, even though the time interval is smaller:

                            v  s
ts                              
2014-11-03 03:39:42.200914  1  1
2014-11-03 03:39:42.500914  2  3
2014-11-03 03:39:43.600914  3  3
2014-11-03 03:39:43.620914  4  7

So, where are the delimiters(separators) that TimeGrouper places? I want the right endpoint of the interval to coincide with the row I am looking at(the row that s corresponds to)

Upvotes: 0

Views: 2395

Answers (1)

Anzel
Anzel

Reputation: 20553

set ts as index and then groupby second, and transform with cumsum() as a new column s, then apply reset_index, like this:

df
                          ts  v
0 2014-11-03 03:39:42.200914  1
1 2014-11-03 03:39:42.500914  2
2 2014-11-03 03:39:43.600914  3
3 2014-11-03 03:39:43.620914  4

df = df.set_index('ts')

df['s'] = df.groupby(lambda x: x.second).transform(cumsum)
df = df.reset_index()
df

                          ts  v  s
0 2014-11-03 03:39:42.200914  1  1
1 2014-11-03 03:39:42.500914  2  3
2 2014-11-03 03:39:43.600914  3  3
3 2014-11-03 03:39:43.620914  4  7

You may want to reorder the columns:

df = df[['s','ts','v']]
df
   s                         ts  v
0  1 2014-11-03 03:39:42.200914  1
1  3 2014-11-03 03:39:42.500914  2
2  3 2014-11-03 03:39:43.600914  3
3  7 2014-11-03 03:39:43.620914  4

Updated

As OP updates that a generic method is need, pd.TimeGrouper can be used:

Another updates (full steps provided)

df = pd.DataFrame([['2014-11-03 03:39:42.200914',1],['2014-11-03 03:39:42.500914', 2],['2014-11-03 03:39:43.600914',3],['2014-11-03 03:39:43.620914', 4]], columns=['ts','v'], dtype=object)

# you will get type error if you haven't converted your string to datetime 
df['ts'] = [pd.to_datetime(d) for d in df['ts']]

df = df.set_index('ts')

See this line

# from the doc we need to add closed='left' to include the first nbin count
df['s'] = df.groupby(pd.TimeGrouper(freq='340mS', closed='left')).transform(cumsum)

# reset the index
df = df.reset_index()

# reorder the columns
df = df[['s', 'ts', 'v']]

df
   s                         ts  v
0  1 2014-11-03 03:39:42.200914  1
1  3 2014-11-03 03:39:42.500914  2
2  3 2014-11-03 03:39:43.600914  3
3  7 2014-11-03 03:39:43.620914  4

However on '400mS' I agreed we still not getting the desire result.

Upvotes: 3

Related Questions