Reputation: 4307
Suppose I have
df
ts v
0 2014-11-03 03:39:42.200914 1
1 2014-11-03 03:39:42.500914 2
2 2014-11-03 03:39:43.600914 3
3 2014-11-03 03:39:43.620914 4
I want to append a column s, such that in every row, it contains the sum of v
s within a 1-second lookback time interval, such as
desired_df
s ts v
0 1 2014-11-03 03:39:42.200914 1
1 3 2014-11-03 03:39:42.500914 2
2 3 2014-11-03 03:39:43.600914 3
3 7 2014-11-03 03:39:43.620914 4
So, how do I generate this additional column s
?
Noe that the intervals should be overlapping, but they must have right endpoint at every row(datapoint) in the dataframe, i.e. every row(datapoint) in my dataframe must be a right endpoint of an interval of that size.
EDIT: THE ANSWER BELOW IS NOT RIGHT?COMPLETE
EDIT: I would like the solution to work for a generic time interval, such as 14ms, not necessarily just for 1 second.
How about
df['s'] = df.groupby(pd.TimeGrouper(freq='400mS')).transform(numpy.cumsum)
I got
ts v s
0 2014-11-03 03:39:42.200914 1 1
1 2014-11-03 03:39:42.500914 2 2
2 2014-11-03 03:39:43.600914 3 3
3 2014-11-03 03:39:43.620914 4 7
Isn't the row indexed 1 wrong? s at 03:39:42.500914 should be 2+1=3. not 2, because the first 2 rows are within 400ms, so it should add them together. Why is it not doing that?
EDIT: When I try
df['s'] = df.groupby(pd.TimeGrouper(freq='340mS')).transform(numpy.cumsum)
It actually performs that grouping, even though the time interval is smaller:
v s
ts
2014-11-03 03:39:42.200914 1 1
2014-11-03 03:39:42.500914 2 3
2014-11-03 03:39:43.600914 3 3
2014-11-03 03:39:43.620914 4 7
So, where are the delimiters(separators) that TimeGrouper places? I want the right endpoint of the interval to coincide with the row I am looking at(the row that s corresponds to)
Upvotes: 0
Views: 2395
Reputation: 20553
set ts
as index and then groupby
second, and transform with cumsum()
as a new column s, then apply reset_index
, like this:
df
ts v
0 2014-11-03 03:39:42.200914 1
1 2014-11-03 03:39:42.500914 2
2 2014-11-03 03:39:43.600914 3
3 2014-11-03 03:39:43.620914 4
df = df.set_index('ts')
df['s'] = df.groupby(lambda x: x.second).transform(cumsum)
df = df.reset_index()
df
ts v s
0 2014-11-03 03:39:42.200914 1 1
1 2014-11-03 03:39:42.500914 2 3
2 2014-11-03 03:39:43.600914 3 3
3 2014-11-03 03:39:43.620914 4 7
You may want to reorder the columns:
df = df[['s','ts','v']]
df
s ts v
0 1 2014-11-03 03:39:42.200914 1
1 3 2014-11-03 03:39:42.500914 2
2 3 2014-11-03 03:39:43.600914 3
3 7 2014-11-03 03:39:43.620914 4
As OP updates that a generic method is need, pd.TimeGrouper
can be used:
Another updates (full steps provided)
df = pd.DataFrame([['2014-11-03 03:39:42.200914',1],['2014-11-03 03:39:42.500914', 2],['2014-11-03 03:39:43.600914',3],['2014-11-03 03:39:43.620914', 4]], columns=['ts','v'], dtype=object)
# you will get type error if you haven't converted your string to datetime
df['ts'] = [pd.to_datetime(d) for d in df['ts']]
df = df.set_index('ts')
# from the doc we need to add closed='left' to include the first nbin count
df['s'] = df.groupby(pd.TimeGrouper(freq='340mS', closed='left')).transform(cumsum)
# reset the index
df = df.reset_index()
# reorder the columns
df = df[['s', 'ts', 'v']]
df
s ts v
0 1 2014-11-03 03:39:42.200914 1
1 3 2014-11-03 03:39:42.500914 2
2 3 2014-11-03 03:39:43.600914 3
3 7 2014-11-03 03:39:43.620914 4
However on '400mS' I agreed we still not getting the desire result.
Upvotes: 3