Skywalker
Skywalker

Reputation: 25

Efficiency: Check if value in Pandas DataFrame has changed with an specific threshold outside an interval of measurement

I have a column which a sensor recorded.
This Data has some noise on it, so the values are not exactly the same for each point of time while nothing was detected.
I want to split that recorded DataFrame into new DataFrames only containing the "interesting" Data (with values bigger than a certain threshold of column 'B', in this example bigger than 5).
In this example 'A' represents a timestamp, and 'B' represents the sensor data, with noise.
The desired outcome of this example would be two DataFrames. One with the rows from 5 to 6, the other one with the rows from 10 to 15.

A normal loop over the DataFrame is very time consuming, as the DataFrame has ~24mio rows. Is there a efficient way to deal with such an issue in pandas or similar?

Example:


# |  A  |  B  
--+-----+-----
1 |  1  |  0.10
2 |  2  |  0.11
3 |  3  |  0.09
4 |  4  |  0.12
5 |  5  |  5.24
6 |  6  |  6.33
7 |  7  |  0.08
8 |  8  |  0.09
9 |  9  |  0.10
10|  10 |  7.54
11|  11 |  8.33
12|  12 |  9.03
13|  13 |  1.43
14|  14 |  9.64
15|  15 |  9.03
16|  16 |  0.43
17|  17 |  0.53
18|  18 |  0.62
19|  19 |  0.73
20|  20 |  0.51

It can occur, that in between the "interesting interval" a value below the threshold occurs. A indicator of an ended interval would be that 1000 values in a row are below the threshold.

Thank you!

Upvotes: 0

Views: 601

Answers (1)

Erfan
Erfan

Reputation: 42886

Here's a solution which is generalisable and tries to catch edge cases:

# all rows where B > 5
mask1 = df['B'].gt(5)

# all rows where Bt-1 > 5 & Bt+1 > 5
mask2 = df['B'].shift().gt(5) & df['B'].shift(-1).gt(5)

# all rows where mask1 OR mask2 is True
mask3 = (mask1 | mask2)

# turn rows where mask 3 is False to NaN
mask4 = mask3.astype(int).diff().eq(1).cumsum().where(mask3)

# put each group of turned on sensor into a different dataframe
dfs = [dfg.reset_index(drop=True) for _, dfg in df.groupby(mask4)]

Output

for d in dfs:
    print(d, '\n')

   A    B
4  5 5.24
5  6 6.33 

     A    B
9   10 7.54
10  11 8.33
11  12 9.03
12  13 1.43
13  14 9.64
14  15 9.03 

Or in a function:

def split_turn_on_off(dataframe):
    mask1 = dataframe['B'].gt(5)
    mask2 = dataframe['B'].shift().gt(5) & dataframe['B'].shift(-1).gt(5)
    mask3 = (mask1 | mask2)
    mask4 = mask3.astype(int).diff().eq(1).cumsum().where(mask3)

    # put each group of turned on sensor into a different dataframe
    dataframes = [dataframeg.reset_index(drop=True) for _, dataframeg in dataframe.groupby(mask4)]

    return dataframes

Upvotes: 1

Related Questions