Reputation: 352
I'd like some feedback on how to optimize the following pandas computation:
We have a fixed index set I
and a lookback
. In addition we have a pd.Series index
its median over lookback, index_MEDIAN
, and a large list of pandas data frames. All series/dataframes have I as their index. Each dataframe has the column value
. Let D
be one such dataframe..
For every row of D
we take the corresponding value m
in index_MEDIAN
and sum all value entries present in the lookback window, subject to the condition that the running value in the index
series is greater than m
. In other words whenever the Index value is greater than the median of the lookback we sum the corresponding value row in D
.
To shed more light, here is a sketch of the implementation described above
def sumvals(x)
S = (D['value'].loc[x.index] >= self.index_median.loc[x.index[-1]])
return sum(S*(x-self.index_median.loc[x.index[-1]]))
D['value'].rolling(lookback).apply(sumvals)
The list of dataframes is quite huge and I've noticed this way of computing this quantity takes an excessive amount of time. I suspect the issue is related to the fact that this implementation uses .loc
a lot. Hence
Is there another way to express this solution without having to reference an external Series so much?
Either way, any kind of optimization suggestion is welcome.
Edit. Here is an example dataset with the respective computation.
lookback = 3
Index = pd.Series([1,-2,8,-10,3,4,5, 10, -20, 3])
Index_median = Index.rolling(lookback).median
Values = pd.Series([1,2,2,3,0,9,10, 8, 20, 9])
The resulting computation on Values should yield
0 NaN
1 NaN
2 2.0
3 13.0
4 0.0
5 6.0
6 11.0
7 12.0
8 23.0
9 28.0
For example the value in the 5th row is 6. Why? The Index_median value in the 5th row is 3. The 3-lookback in the 5th row is the sequence 9 , 0, 3. The values >= are 3 and 9 so this comprises our sum for the 5th row 3-3+9-3 = 6. Similarly for the last row the index median is 3. The last three rows in Values are all greater than 3 and sum up to 34 - 3*3 = 28.
Upvotes: 3
Views: 495
Reputation: 1007
Starting with your sample data:
df = pd.DataFrame()
df['I'] = pd.Series([1,-2,8,-10,3,4,5, 10, -20, 3])
df['I_median'] = df['I'].rolling(lookback).median()
df['Values'] = pd.Series([1,2,2,3,0,9,10, 8, 20, 9])
Now add shifted columns for the 'Value' column
# add one column for every lookback
for colno in range(lookback):
# shift the column by one and deduct the median
df['n'+ str(colno)] = df['Values'].shift(colno) - df['I_median']
# remove all negative numbers (where value is smaller than median)
df['n'+ str(colno)] = df['n'+ str(colno)].where(df['n'+ str(colno)]> 0, 0)
# sum up across the new columns
df['result'] = df[df.columns[-lookback:]].sum(axis=1)
df.result contains your result and equals
0 0.0
1 0.0
2 2.0
3 13.0
4 0.0
5 6.0
6 11.0
7 12.0
8 23.0
9 28.0
Name: result, dtype: float64
df['result'] = 0
for colno in range(lookback):
# shift the column by one and deduct the median
df['temp'] = df['Values'].shift(colno) - df['I_median']
# remove all negative numbers (where value is smaller than median)
df['temp'] = df['temp'].where(df['temp']> 0, 0)
# sum up across the new columns
df['result'] = df['result'] + df['temp']
lookback = 1000
df = pd.DataFrame()
df['I'] = pd.Series(np.random.randint(0, 10, size=1000000))
df['I_median'] = df['I'].rolling(lookback).median()
df['Values'] = pd.Series(np.random.randint(0, 10, size=1000000))
runs in about 14s.
Upvotes: 2
Reputation: 1007
.loc is slow and apply is slow. It seems to me that you can achieve what you want using vectorised functions and operations over columns without a row-by-row-apply and loc-lookups.
It is hard to tell without a real data example as @Manakin suggested. But I tried to re-create your question with an example and solve it according to your description.
# lookback window
lookback = 3
# Fixed Index
I = [5, 2, 1, 4, 2, 4, 1, 2, 1, 10]
# Dataframe with value column, Index added as column for convenience
df = pd.DataFrame({'I': I,
'value':[6,5,4,3,2,1, 2, 3, 4, 5]},
index=I)
# Median over lookback window
df['I_median'] = df.I.rolling(lookback).median()
yields
| | I | value | I_median
|----|-------|----------|----------|
| 5 | 5 | 6 | NaN |
| 2 | 2 | 5 | NaN |
| 1 | 1 | 4 | 2.0 |
| 4 | 4 | 3 | 2.0 |
| 2 | 2 | 2 | 2.0 |
| 4 | 4 | 1 | 4.0 |
| 1 | 1 | 2 | 2.0 |
| 2 | 2 | 3 | 2.0 |
| 1 | 1 | 4 | 1.0 |
| 10 | 10 | 5 | 2.0 |
# Check if Index is greater than median
df['I_gt'] = df.I > df.I_median
# set all rows to 0 where median is greater than index
df['filtered_val'] = df.value.where(df.I_gt, 0)
| | I | value | I_median | I_gt | filtered_val |
|----|----|-------|----------|-------|--------------|
| 5 | 5 | 6 | NaN | False | 0 |
| 2 | 2 | 5 | NaN | False | 0 |
| 1 | 1 | 4 | 2.0 | False | 0 |
| 4 | 4 | 3 | 2.0 | True | 3 |
| 2 | 2 | 2 | 2.0 | False | 0 |
| 4 | 4 | 1 | 4.0 | False | 0 |
| 1 | 1 | 2 | 2.0 | False | 0 |
| 2 | 2 | 3 | 2.0 | False | 0 |
| 1 | 1 | 4 | 1.0 | False | 0 |
| 10 | 10 | 5 | 2.0 | True | 5 |
Then simply do a rolling sum over the filtered column.
df.filtered_val.rolling(lookback).sum()
Upvotes: 2
Reputation: 21
def sumvals(x)
m = self.index_median.loc[x.index[-1]]
condition = (x.index >= m)
return sum(x[condition]-m)
D['value'].rolling(lookback).apply(sumvals)
When we are taking sum of all value entries present in the lookback window there is no need to compared them with self.index. Also from your description if your taking value row in D then you can
return sum(x[condition])
directly instead.
Another soln is that you can convert whole operation into numpy for speeding up the rolling operation. Checkout the numpy_ext package for this
Upvotes: 2