Reputation: 3892
In a time series (ordered tuples), what's the most efficient way to find the first time a criterion is met?
In particular, what's the most efficient way to determine when a value goes over 100 for the value of a column in a pandas data frame?
I was hoping for a clever vectorized solution, and not having to use df.iterrows()
.
For example, for price or count data, when a value exceeds 100. I.e. df['col'] > 100.
price
date
2005-01-01 98
2005-01-02 99
2005-01-03 100
2005-01-04 99
2005-01-05 98
2005-01-06 100
2005-01-07 100
2005-01-08 98
but for potentially very large series. Is it better to iterate (slow) or is there a vectorized solution?
A df.iterrows()
solution could be:
for row, ind in df.iterrows():
if row['col'] > value_to_check:
breakpoint = row['value_to_record'].loc[ind]
return breakpoint
return None
But my question is more about efficiency (potentially, a vectorized solution that will scale well).
Upvotes: 17
Views: 53133
Reputation: 154
numpy argmax returns when it finds the first True value. Numpy is a pandas dependency and is then available for you directly :).
df.iloc[np.argmax(df['col'] > 100)]
Upvotes: 2
Reputation: 121
This will return the index value of the first occurrence of 100 in the series:
index_value = (df['col'] - 100).apply(abs).idxmin()
If there is no value exactly 100, it should return the index of the closest value.
Upvotes: 5
Reputation: 25669
Try this: "> 99"
df[df['price'].gt(99)].index[0]
returns "2"
, the second index row.
all row indexes greater than 99
df[df['price'].gt(99)].index
Int64Index([2, 5, 6], dtype='int64')
Upvotes: 29