Reputation: 1290
I am trying to use a pandas.DataFrame.rolling.apply()
rolling function on multiple columns.
Python version is 3.7, pandas is 1.0.2.
import pandas as pd
#function to calculate
def masscenter(x):
print(x); # for debug purposes
return 0;
#simple DF creation routine
df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
['03:00:01.042391', 87.51, 10],
['03:00:01.630182', 87.51, 10],
['03:00:01.635150', 88.00, 792],
['03:00:01.914104', 88.00, 10]],
columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df2['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)
'stamp'
is monotonic and unique, 'price'
is double and contains no NaNs, 'nQty'
is integer and also contains no NaNs.
So, I need to calculate rolling 'center of mass', i.e. sum(price*nQty)/sum(nQty)
.
What I tried so far:
df.apply(masscenter, axis = 1)
masscenter
is be called 5 times with a single row and the output will be like
price 87.6
nQty 739.0
Name: 1900-01-01 02:59:47.000282, dtype: float64
It is desired input to a masscenter
, because I can easily access price
and nQty
using x[0], x[1]
. However, I stuck with rolling.apply()
Reading the docs
DataFrame.rolling() and rolling.apply()
I supposed that using 'axis'
in rolling()
and 'raw'
in apply
one achieves similiar behaviour. A naive approach
rol = df.rolling(window=2)
rol.apply(masscenter)
prints row by row (increasing number of rows up to window size)
stamp
1900-01-01 02:59:47.000282 87.60
1900-01-01 03:00:01.042391 87.51
dtype: float64
then
stamp
1900-01-01 02:59:47.000282 739.0
1900-01-01 03:00:01.042391 10.0
dtype: float64
So, columns is passed to masscenter
separately (expected).
Sadly, in the docs there is barely any info about 'axis'
. However the next variant was, obviously
rol = df.rolling(window=2, axis = 1)
rol.apply(masscenter)
Never calls masscenter
and raises ValueError in rol.apply(..)
> Length of passed values is 1, index implies 5
I admit that I'm not sure about 'axis'
parameter and how it works due to lack of documentation. It is the first part of the question:
What is going on here? How to use 'axis' properly? What it is designed for?
Of course, there were answers previously, namely:
How-to-apply-a-function-to-two-columns-of-pandas-dataframe
It works for the whole DataFrame, not Rolling.
How-to-invoke-pandas-rolling-apply-with-parameters-from-multiple-column
The answer suggests to write my own roll function, but the culprit for me is the same as asked in comments: what if one needs to use offset window size (e.g. '1T'
) for non-uniform timestamps?
I don't like the idea to reinvent the wheel from scratch. Also I'd like to use pandas for everything to prevent inconsistency between sets obtained from pandas and 'self-made roll'.
There is another answer to that question, suggessting to populate dataframe separately and calculate whatever I need, but it will not work: the size of stored data will be enormous.
The same idea presented here:
Apply-rolling-function-on-pandas-dataframe-with-multiple-arguments
Another Q & A posted here
Pandas-using-rolling-on-multiple-columns
It is good and the closest to my problem, but again, there is no possibility to use offset window sizes (window = '1T'
).
Some of the answers were asked before pandas 1.0 came out, and given that docs could be much better, I hope it is possible to roll over multiple columns simultaneously now.
The second part of the question is: Is there any possibility to roll over multiple columns simultaneously using pandas 1.0.x with offset window size?
Upvotes: 40
Views: 48462
Reputation: 11
(df['price'] * df['nQty']).rolling(2).sum() / df['nQty'].rolling(2).sum()
# output
stamp
1900-01-01 02:59:47.000282 NaN
1900-01-01 03:00:01.042391 87.598798
1900-01-01 03:00:01.630182 87.510000
1900-01-01 03:00:01.635150 87.993890
1900-01-01 03:00:01.914104 88.000000
dtype: float64
You can use rolling sum for price*nQty
and nQty
part then calculating the mean. The same solution can be used with offset window size.
Upvotes: 1
Reputation: 2071
How about this:
import pandas as pd
def masscenter(ser: pd.Series, df: pd.DataFrame):
df_roll = df.loc[ser.index]
return your_actual_masscenter(df_roll)
masscenter_output = df['price'].rolling(window=3).apply(masscenter, args=(df,))
It uses the rolling logic to get subsets via an arbitrary column. The arbitrary column itself is not used, only the rolling index is used. This relies on the default of raw=False
which provides the index values for those subsets. The applied function uses those index values to get multi-column slices from the original dataframe.
Upvotes: 46
Reputation: 868
For performing a rolling window operation with access to all columns of a dataframe, you can pass mehtod='table'
to rolling()
. Example:
import pandas as pd
import numpy as np
from numba import jit
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], 'b': [1, 3, 5, 7, 9, 11]})
@jit
def f(w):
# we have access to both columns of the dataframe here
return np.max(w), np.min(w)
df.rolling(3, method='table').apply(f, raw=True, engine='numba')
It should be noted that method='table'
requires numba engine (pip install numba
). The @jit
part in the example is not mandatory but helps with performance. The result of the above example code will be:
a | b |
---|---|
NaN | NaN |
NaN | NaN |
5.0 | 1.0 |
7.0 | 2.0 |
9.0 | 3.0 |
11.0 | 4.0 |
Upvotes: 6
Reputation: 351
How about this?
ggg = pd.DataFrame({"a":[1,2,3,4,5,6,7], "b":[7,6,5,4,3,2,1]})
def my_rolling_apply2(df, fun, window):
prepend = [None] * (window - 1)
end = len(df) - window
mid = map(lambda start: fun(df[start:start + window]), np.arange(0,end))
last = fun(df[end:])
return [*prepend, *mid, last]
my_rolling_apply2(ggg, lambda df: (df["a"].max(), df["b"].min()), 3)
And result is:
[None, None, (3, 5), (4, 4), (5, 3), (6, 2), (7, 1)]
Upvotes: 0
Reputation: 80192
With reference to the excellent answer from @saninstein.
Install numpy_ext from: https://pypi.org/project/numpy-ext/
import numpy as np
import pandas as pd
from numpy_ext import rolling_apply as rolling_apply_ext
def box_sum(a,b):
return np.sum(a) + np.sum(b)
df = pd.DataFrame({"x": [1,2,3,4], "y": [1,2,3,4]})
window = 2
df["sum"] = rolling_apply_ext(box_sum, window , df.x.values, df.y.values)
Output:
print(df.to_string(index=False))
x y sum
1 1 NaN
2 2 6.0
3 3 10.0
4 4 14.0
Notes
rolling_apply
as rolling_apply_ext
so it cannot possibly interfere with any existing calls to Pandas rolling_apply
(thanks to comment by @LudoSchmidt).As a side note, I gave up trying to use Pandas. It's fundamentally broken: it handles single-column aggreagate and apply with little problems, but it's a overly complex rube-goldberg machine when trying to get it to work with more two columns or more.
Upvotes: 6
Reputation: 1290
So I found no way to roll over two columns, however without inbuilt pandas functions. The code is listed below.
# function to find an index corresponding
# to current value minus offset value
def prevInd(series, offset, date):
offset = to_offset(offset)
end_date = date - offset
end = series.index.searchsorted(end_date, side="left")
return end
# function to find an index corresponding
# to the first value greater than current
# it is useful when one has timeseries with non-unique
# but monotonically increasing values
def nextInd(series, date):
end = series.index.searchsorted(date, side="right")
return end
def twoColumnsRoll(dFrame, offset, usecols, fn, columnName = 'twoColRol'):
# find all unique indices
uniqueIndices = dFrame.index.unique()
numOfPoints = len(uniqueIndices)
# prepare an output array
moving = np.zeros(numOfPoints)
# nameholders
price = dFrame[usecols[0]]
qty = dFrame[usecols[1]]
# iterate over unique indices
for ii in range(numOfPoints):
# nameholder
pp = uniqueIndices[ii]
# right index - value greater than current
rInd = afta.nextInd(dFrame,pp)
# left index - the least value that
# is bigger or equal than (pp - offset)
lInd = afta.prevInd(dFrame,offset,pp)
# call the actual calcuating function over two arrays
moving[ii] = fn(price[lInd:rInd], qty[lInd:rInd])
# construct and return DataFrame
return pd.DataFrame(data=moving,index=uniqueIndices,columns=[columnName])
This code works, but it is relatively slow and inefficient. I suppose one can use numpy.lib.stride_tricks from How to invoke pandas.rolling.apply with parameters from multiple column? to speedup things.
However, go big or go home - I ended writing a function in C++ and a wrapper for it.
I'd like not to post it as answer, since it is a workaround and I have not answered neither part of my question, but it is too long for a commentary.
Upvotes: 1
Reputation: 301
You can use rolling_apply function from numpy_ext module:
import numpy as np
import pandas as pd
from numpy_ext import rolling_apply
def masscenter(price, nQty):
return np.sum(price * nQty) / np.sum(nQty)
df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
['03:00:01.042391', 87.51, 10],
['03:00:01.630182', 87.51, 10],
['03:00:01.635150', 88.00, 792],
['03:00:01.914104', 88.00, 10]],
columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)
window = 2
df['y'] = rolling_apply(masscenter, window, df.price.values, df.nQty.values)
print(df)
price nQty y
stamp
1900-01-01 02:59:47.000282 87.60 739 NaN
1900-01-01 03:00:01.042391 87.51 10 87.598798
1900-01-01 03:00:01.630182 87.51 10 87.510000
1900-01-01 03:00:01.635150 88.00 792 87.993890
1900-01-01 03:00:01.914104 88.00 10 88.000000
Upvotes: 20