Reputation: 359
I am trying to create a program that will select rows in a large time series dataframe and return only the rows where the slope (row2-row1)/(Time2-Time1) is >= the slope of the previous two rows (row1-row0)/(Time1-Time0). I have gone through some very heavy loop operations only to meet the extent for loops.
Time A B C
1.731806 NA NA 6.715383
1.794554 NA 6.899219 6.952729
1.79434 6.689599 6.903747 6.860664
1.928432 6.962717 7.126891 7.115988
2.000463 7.226573 7.330078 7.362645
2.090469 7.444833 7.632159 7.575328
2.175347 7.709757 7.836567 7.663173
2.765631 8.904291 8.970686 8.894602
2.905104 9.018817 8.997333 8.97221
2.980718 8.796641 8.793764 8.637462
3.058576 9.050172 9.118444 9.174402
I am hoping there is a simple vectorized approach to this but I would also like to take this a step further, where I would like to return values where the slope (row3-row2)/(Time3-Time2) >= the average of the two previous slopes between rows ([(row2-row1)/(Time2-Time1)]+[row1-row0)/(Time1-Time0)])/2 OR just (row2-row0)/(Time2-Time0).
The data frame is also very large so I am hoping to iterate this function across many columns (A, B, C...Z), which is why I'm hoping for a vectorized approach, I don't care all that much about the speed of the whole operation.
I would really appreciate help to any part of the question. It is to be used to to omit data for microbial growth that is entering the stationary phase (when the slope dies down) in very large data sets. FYI data points that are NA exist because of a row conditional I created to filter out the lag phase of microbial growth.
Thank you very very much!!!
Upvotes: 2
Views: 2166
Reputation: 294228
Let's start by calculating the dataframe of slopes:
slopes = df.iloc[:, 1:].diff().div(df.Time.diff(), axis=0)
This uses diff
or difference of every row with the one prior and divides each column that isn't the first by the first.
And we might as well calculate the three period slopes while we're at it.
three_period_slopes = df.iloc[:, 1:].diff(2).div(df.Time.diff(2), axis=0)
Now, you mentioned doing this for each column and that you wanted it vectorized. So, were going to calculate the dataframe of truth values that you can use later to slice.
gt_prior_slope = slopes >= slopes.shift()
gt_prior_2_slope_avg = slopes >= slopes.rolling(2).mean().shift()
gt_prior_3_period_slope = slopes >= three_period_slopes.shift()
We use them like this:
df.C[gt_prior_slope.C]
2 6.860664
4 7.362645
7 8.894602
10 9.174402
Name: C, dtype: float64
Upvotes: 2