Reputation: 415
I have two pandas time-series dataframes and I want to aggregate the values against one time series based on the intervals of the other one. Let me show by example. The first time series is as follows:
date value
0 2016-03-21 10
1 2016-03-25 10
2 2016-04-10 10
3 2016-05-05 10
The second one is a date range with 10 calendar days intervals extracted from the above series. I have written the code to extract this from above data.
date
0 2016-03-21
1 2016-03-31
2 2016-04-10
3 2016-04-20
4 2016-04-30
I want to write some code to get this resultant dataframe:
date value
0 2016-03-21 20
1 2016-03-31 0
2 2016-04-10 10
3 2016-04-20 0
4 2016-04-30 10
Could please suggest a way to do this without using loops(preferably) in python?
Upvotes: 4
Views: 337
Reputation: 323226
Just have time adding my solution , numpy
broadcast
s1=df1.date.values
s2=df2.date.values
a=(np.abs(s1-s2[:,None])/np.timedelta64(60*60*24, 's')<10).dot(df1.value.values)
a
Out[183]: array([20, 10, 10, 0, 10], dtype=int64)
#df2['value']=a
Upvotes: 3
Reputation: 294258
searchsorted
This is the first thing I thought of but it wasn't trivial to iron out. @Vaishali's answer is in spirit very similar to this and simpler. But I'm like a dog with a bone and I can't let it go until I figure it out.
To explain a little bit. searchsorted
will go through an array, In this case the equally spaced dates, and find where in another array they would be placed in order to maintain sortedness. This sounds complicated but if we visualize, we can see what is going on. I'll use letters to demonstrate. I'll choose the letters to correspond with the dates.
x = np.array([*'abdg'])
y = np.array([*'acdef'])
Notice that for each letter in x
I found where the backstop was in y
# i -> 0 0 2 4
# x -> a b d g
# y -> a c d e f
This works out to what I do below.
df = pd.DataFrame(dict(
date=pd.to_datetime(['2016-03-21', '2016-03-25', '2016-04-10', '2016-05-05']),
value=[10, 10, 10, 10]
))
dates = pd.date_range(df.date.min(), df.date.max(), freq='10D')
d = df.date.values
v = df.value.values
i = dates.searchsorted(d, side='right') - 1
a = np.zeros(len(dates), dtype=v.dtype)
np.add.at(a, i, v)
pd.DataFrame(dict(
date=dates, value=a
))
date value
0 2016-03-21 20
1 2016-03-31 0
2 2016-04-10 10
3 2016-04-20 0
4 2016-04-30 10
You'll notice I used np.add.at
inorder to sum v
at just the right spots. I could have also done this with np.bincount
. I like the approach above better because np.bincount
casts to float
even though the v
is of type int
.
d = df.date.values
v = df.value.values
i = dates.searchsorted(d, side='right') - 1
pd.DataFrame(dict(
date=dates, value=np.bincount(i, v).astype(v.dtype)
))
date value
0 2016-03-21 20
1 2016-03-31 0
2 2016-04-10 10
3 2016-04-20 0
4 2016-04-30 10
Upvotes: 4
Reputation: 38415
You can bin the data in df1 based on bins in df2 dates,
bins = pd.date_range(df2.date.min(), df2.date.max() + pd.DateOffset(10), freq = '10D')
labels = df2.date
df1.groupby(pd.cut(df1.date, bins = bins, right = False, labels = labels)).value.sum().reset_index()
date value
0 2016-03-21 20
1 2016-03-31 0
2 2016-04-10 10
3 2016-04-20 0
4 2016-04-30 10
Upvotes: 6