Reputation: 21
I have just started learning python and struggling with this code. I have a dataframe which looks like shown in graph.
I want to find the first occurrence of the window in the dataframe which has certain number of values greater than threshold.
For Exmaple:
Let's say dataframe is 1000000 values in dimension. I want divide this in sliding window of 1000 and need to know whether this 1000 values alteast have 10 values which are greater than certain threshold. If first window (point 0-999) does not have the atleast 10 values greater than certain threshold, window will slide and consider values 1-1000. I have to find the index of the first occurrence of window which has atleast 10 values greater than threshold.
Also as I am dealing here streaming data, I need to stop the search when such window in dataframe occurs.
I tried this code but getting key error and not able to solve the problem.
for i in np.arange(0,len(data)-999):
for j in np.arange(0,1000):
if data[i+j]>threshold:
var_count=var_count+1
if var_count>10:
print("Anomaly has occurred")
Sample data looks like this which has around 1.8 million rows.
Sample data could looke like this
data_sample=[1,1,0,0,0,2,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,1,0,0,2,2,2,2,1,1,1]
data_sample=pd.DataFrame(data_sample)
threshold=1
window=5
in that I need at least 2 values which are greater than 1
which would return the index 18 as at that index my window of length 5 has at least 2 values which are greater than 1.
Upvotes: 2
Views: 760
Reputation: 45752
You can do it with convolution:
threshold = 10
window_size = 5
count_threshold = 3
kernel = np.ones(window_size)
over_threshold = (data['relevant_column'] > threshold).values
running_count = np.convolve(kernel, over_threshold)
np.nonzero(running_count >= count_threshold)[0]
Or a similar idea using pandas rolling:
np.where(((data['relevant_column'] > threshold).rolling(window_size).sum() >= count_threshold))
Upvotes: 3