Liam Hanninen
Liam Hanninen

Reputation: 1573

Pandas Apply and Loc - efficiency and indexing

I want to find the first value after each row that meets a certain criteria. So for example I want to find the first rate/value (not necessarily the first row after) after the current row that increased 5%. The added column would be the last 'first5percentIncrease' and would be the index (and/or value) of the first row (after current row) that had a 5% increase. Notice how each could not be lower than the current row's index.

          amount    date    rate    total   type first5percentIncreaseValue first5percentIncreaseIndex
9248    0.05745868  2018-01-22 06:11:36 10  0.00099984  buy 10.5 9341
9249    1.14869147  2018-01-22 06:08:38 20  0.01998989  buy 21 9421
9250    0.16498080  2018-01-22 06:02:59 15  0.00286241  sell 15.75 9266
9251    0.02881844  2018-01-22 06:01:54 2   0.00049999  sell 2.1 10911

I tried using loc to apply() this to each row. The output takes at least 10 seconds for only about 9k rows. This does the job (I get a list of all values 5% higher than the given row) but is there a more efficient way to do this? Also I'd like to get only the first value but when I take do this I think it's starting from the first row. Is there a way to start .locs search from the current row so then I could just take the first value?

coin_trade_history_df['rate'].apply(
    lambda y: coin_trade_history_df['rate'].loc[coin_trade_history_df['rate'].apply(
        lambda x: y  >= x + (x*.005))])

0    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
1    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
2    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
3    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
4    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
Name: rate, dtype: object

Further clarification Peter Leimbigler said it better than me:

Oh, I think I get it now! "For each row, scan downward and get the first row you encounter that shows an increase of at least 5%," right? I'll edit my answer :) – Peter Leimbigler

Upvotes: 2

Views: 3692

Answers (2)

Liam Hanninen
Liam Hanninen

Reputation: 1573

Peter's answer was much faster but it only looked at the immediate next row. I wanted it to perform this on every row. Below is what I ended up with - not very fast but it goes through each row and returns the first value (or last value in my case since my time series was descending) that satisfied my criteria (increasing 5%).

def test_rows(x):
    return trade_history_df['rate'].loc[
        trade_history_df['rate'] >= x['rate'] + (x['rate'] * .05)].loc[
        trade_history_df['date'] > x['date']].last_valid_index()

test1 = trade_history_df[['rate','date']].apply(test_rows,axis = 1)

Upvotes: 1

Peter Leimbigler
Peter Leimbigler

Reputation: 11105

Here's an approach to the specific example of labeling each row with the index of the next available row that shows an increase of at least 5%.

# Example data
df = pd.DataFrame({'rate': [100, 105, 99, 110, 130, 120, 98]})

# Series.shift(n) moves elements n places forward = down. We use
# it here in the denominator in order to compare each change with 
# the initial value, rather than the final value.

mask = df.rate.diff()/df.rate.shift() >= 0.05

df.loc[mask, 'next_big_change_idx'] = df[mask].index
df.next_big_change_idx = df.next_big_change_idx.bfill().shift(-1)

# output
df
   rate  next_big_change_idx
0   100                  1.0
1   105                  3.0
2    99                  3.0
3   110                  4.0
4   130                  NaN
5   120                  NaN
6    98                  NaN

Upvotes: 1

Related Questions