Bhakti
Bhakti

Reputation: 21

How to find index of the window which has certain number of values greater than threshold?

enter image description here

I have just started learning python and struggling with this code. I have a dataframe which looks like shown in graph.

I want to find the first occurrence of the window in the dataframe which has certain number of values greater than threshold.

For Exmaple:

Let's say dataframe is 1000000 values in dimension. I want divide this in sliding window of 1000 and need to know whether this 1000 values alteast have 10 values which are greater than certain threshold. If first window (point 0-999) does not have the atleast 10 values greater than certain threshold, window will slide and consider values 1-1000. I have to find the index of the first occurrence of window which has atleast 10 values greater than threshold.

Also as I am dealing here streaming data, I need to stop the search when such window in dataframe occurs.

I tried this code but getting key error and not able to solve the problem.

for i in np.arange(0,len(data)-999):
    for j in np.arange(0,1000):
        if data[i+j]>threshold:
            var_count=var_count+1
        if var_count>10:
            print("Anomaly has occurred")

Sample data looks like this which has around 1.8 million rows.

enter image description here

Sample data could looke like this

data_sample=[1,1,0,0,0,2,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,1,0,0,2,2,2,2,1,1,1]            
data_sample=pd.DataFrame(data_sample)

threshold=1
window=5

in that I need at least 2 values which are greater than 1 which would return the index 18 as at that index my window of length 5 has at least 2 values which are greater than 1.

Upvotes: 2

Views: 760

Answers (1)

Dan
Dan

Reputation: 45752

You can do it with convolution:

threshold = 10
window_size = 5
count_threshold = 3

kernel = np.ones(window_size)
over_threshold = (data['relevant_column'] > threshold).values
running_count = np.convolve(kernel, over_threshold)
np.nonzero(running_count >= count_threshold)[0]

Or a similar idea using pandas rolling:

np.where(((data['relevant_column'] > threshold).rolling(window_size).sum() >= count_threshold))

Upvotes: 3

Related Questions