Reputation: 25
I have a column which a sensor recorded.
This Data has some noise on it, so the values are not exactly the same for each point of time while nothing was detected.
I want to split that recorded DataFrame into new DataFrames only containing the "interesting" Data (with values bigger than a certain threshold of column 'B', in this example bigger than 5).
In this example 'A' represents a timestamp, and 'B' represents the sensor data, with noise.
The desired outcome of this example would be two DataFrames. One with the rows from 5 to 6, the other one with the rows from 10 to 15.
A normal loop over the DataFrame is very time consuming, as the DataFrame has ~24mio rows. Is there a efficient way to deal with such an issue in pandas or similar?
Example:
# | A | B
--+-----+-----
1 | 1 | 0.10
2 | 2 | 0.11
3 | 3 | 0.09
4 | 4 | 0.12
5 | 5 | 5.24
6 | 6 | 6.33
7 | 7 | 0.08
8 | 8 | 0.09
9 | 9 | 0.10
10| 10 | 7.54
11| 11 | 8.33
12| 12 | 9.03
13| 13 | 1.43
14| 14 | 9.64
15| 15 | 9.03
16| 16 | 0.43
17| 17 | 0.53
18| 18 | 0.62
19| 19 | 0.73
20| 20 | 0.51
It can occur, that in between the "interesting interval" a value below the threshold occurs. A indicator of an ended interval would be that 1000 values in a row are below the threshold.
Thank you!
Upvotes: 0
Views: 601
Reputation: 42886
Here's a solution which is generalisable and tries to catch edge cases:
# all rows where B > 5
mask1 = df['B'].gt(5)
# all rows where Bt-1 > 5 & Bt+1 > 5
mask2 = df['B'].shift().gt(5) & df['B'].shift(-1).gt(5)
# all rows where mask1 OR mask2 is True
mask3 = (mask1 | mask2)
# turn rows where mask 3 is False to NaN
mask4 = mask3.astype(int).diff().eq(1).cumsum().where(mask3)
# put each group of turned on sensor into a different dataframe
dfs = [dfg.reset_index(drop=True) for _, dfg in df.groupby(mask4)]
Output
for d in dfs:
print(d, '\n')
A B
4 5 5.24
5 6 6.33
A B
9 10 7.54
10 11 8.33
11 12 9.03
12 13 1.43
13 14 9.64
14 15 9.03
Or in a function:
def split_turn_on_off(dataframe):
mask1 = dataframe['B'].gt(5)
mask2 = dataframe['B'].shift().gt(5) & dataframe['B'].shift(-1).gt(5)
mask3 = (mask1 | mask2)
mask4 = mask3.astype(int).diff().eq(1).cumsum().where(mask3)
# put each group of turned on sensor into a different dataframe
dataframes = [dataframeg.reset_index(drop=True) for _, dataframeg in dataframe.groupby(mask4)]
return dataframes
Upvotes: 1