pythonpandasdataframerolling-computation

Reputation: 352

Speeding up a pandas rolling referencing another dataframe

I'd like some feedback on how to optimize the following pandas computation:

We have a fixed index set I and a lookback. In addition we have a pd.Series index its median over lookback, index_MEDIAN, and a large list of pandas data frames. All series/dataframes have I as their index. Each dataframe has the column value. Let D be one such dataframe..

For every row of D we take the corresponding value m in index_MEDIAN and sum all value entries present in the lookback window, subject to the condition that the running value in the index series is greater than m. In other words whenever the Index value is greater than the median of the lookback we sum the corresponding value row in D.

To shed more light, here is a sketch of the implementation described above

  def sumvals(x)

      S = (D['value'].loc[x.index] >= self.index_median.loc[x.index[-1]])
 
      return sum(S*(x-self.index_median.loc[x.index[-1]]))

   D['value'].rolling(lookback).apply(sumvals)

The list of dataframes is quite huge and I've noticed this way of computing this quantity takes an excessive amount of time. I suspect the issue is related to the fact that this implementation uses .loc a lot. Hence

Is there another way to express this solution without having to reference an external Series so much?

Either way, any kind of optimization suggestion is welcome.

Edit. Here is an example dataset with the respective computation.

lookback = 3
Index = pd.Series([1,-2,8,-10,3,4,5, 10, -20, 3])
Index_median = Index.rolling(lookback).median
Values = pd.Series([1,2,2,3,0,9,10, 8, 20, 9])

The resulting computation on Values should yield

For example the value in the 5th row is 6. Why? The Index_median value in the 5th row is 3. The 3-lookback in the 5th row is the sequence 9 , 0, 3. The values >= are 3 and 9 so this comprises our sum for the 5th row 3-3+9-3 = 6. Similarly for the last row the index median is 3. The last three rows in Values are all greater than 3 and sum up to 34 - 3*3 = 28.

Upvotes: 3

Answers (3)

tomanizer

Reputation: 1007

Starting with your sample data:

df = pd.DataFrame()
df['I'] = pd.Series([1,-2,8,-10,3,4,5, 10, -20, 3])
df['I_median'] = df['I'].rolling(lookback).median()
df['Values'] = pd.Series([1,2,2,3,0,9,10, 8, 20, 9])

Now add shifted columns for the 'Value' column

# add one column for every lookback    
for colno in range(lookback):  

        # shift the column by one and deduct the median
        df['n'+ str(colno)] = df['Values'].shift(colno) - df['I_median']

        # remove all negative numbers (where value is smaller than median)
        df['n'+ str(colno)] = df['n'+ str(colno)].where(df['n'+ str(colno)]> 0, 0)

# sum up across the new columns
df['result'] = df[df.columns[-lookback:]].sum(axis=1)

df.result contains your result and equals

0     0.0
1     0.0
2     2.0
3    13.0
4     0.0
5     6.0
6    11.0
7    12.0
8    23.0
9    28.0
Name: result, dtype: float64

Edit: without shifted columns in data frame

df['result'] = 0

for colno in range(lookback):  
        # shift the column by one and deduct the median
        df['temp'] = df['Values'].shift(colno) - df['I_median']

        # remove all negative numbers (where value is smaller than median)
        df['temp'] = df['temp'].where(df['temp']> 0, 0)

        # sum up across the new columns
        df['result'] = df['result'] + df['temp']

Performance

1m rows in dataframe
1000 lookback

lookback = 1000
df = pd.DataFrame()
df['I'] = pd.Series(np.random.randint(0, 10, size=1000000))
df['I_median'] = df['I'].rolling(lookback).median()
df['Values'] = pd.Series(np.random.randint(0, 10, size=1000000))

runs in about 14s.

Upvotes: 2

tomanizer

Reputation: 1007

.loc is slow and apply is slow. It seems to me that you can achieve what you want using vectorised functions and operations over columns without a row-by-row-apply and loc-lookups.

It is hard to tell without a real data example as @Manakin suggested. But I tried to re-create your question with an example and solve it according to your description.

# lookback window    
lookback = 3 

# Fixed Index
I = [5, 2, 1, 4, 2, 4, 1, 2, 1, 10]

# Dataframe with value column, Index added as column for convenience
df = pd.DataFrame({'I': I, 
                   'value':[6,5,4,3,2,1, 2, 3, 4, 5]},
                   index=I)

# Median over lookback window
df['I_median'] = df.I.rolling(lookback).median()

yields

|    |  I    |  value   | I_median
|----|-------|----------|----------|
| 5  | 5     | 6        | NaN      |
| 2  | 2     | 5        | NaN      |
| 1  | 1     | 4        | 2.0      |
| 4  | 4     | 3        | 2.0      |
| 2  | 2     | 2        | 2.0      |
| 4  | 4     | 1        | 4.0      |
| 1  | 1     | 2        | 2.0      |
| 2  | 2     | 3        | 2.0      |
| 1  | 1     | 4        | 1.0      |
| 10 | 10    | 5        | 2.0      |

# Check if Index is greater than median
df['I_gt'] = df.I > df.I_median

# set all rows to 0 where median is greater than index
df['filtered_val'] = df.value.where(df.I_gt, 0)

|    | I  | value | I_median | I_gt  | filtered_val |
|----|----|-------|----------|-------|--------------|
| 5  | 5  | 6     | NaN      | False | 0            |
| 2  | 2  | 5     | NaN      | False | 0            |
| 1  | 1  | 4     | 2.0      | False | 0            |
| 4  | 4  | 3     | 2.0      | True  | 3            |
| 2  | 2  | 2     | 2.0      | False | 0            |
| 4  | 4  | 1     | 4.0      | False | 0            |
| 1  | 1  | 2     | 2.0      | False | 0            |
| 2  | 2  | 3     | 2.0      | False | 0            |
| 1  | 1  | 4     | 1.0      | False | 0            |
| 10 | 10 | 5     | 2.0      | True  | 5            |

Then simply do a rolling sum over the filtered column.

df.filtered_val.rolling(lookback).sum()

Upvotes: 2

Bapi Reddy

Reputation: 21

def sumvals(x)
      m =  self.index_median.loc[x.index[-1]]
      condition = (x.index >= m)
      return sum(x[condition]-m)

   D['value'].rolling(lookback).apply(sumvals)

When we are taking sum of all value entries present in the lookback window there is no need to compared them with self.index. Also from your description if your taking value row in D then you can

return sum(x[condition])

directly instead.

Another soln is that you can convert whole operation into numpy for speeding up the rolling operation. Checkout the numpy_ext package for this

Upvotes: 2

Speeding up a pandas rolling referencing another dataframe

Answers (3)

Edit: without shifted columns in data frame

Performance

Related Questions