Reputation: 11
I am trying to analyze some data in a dataframe and identify when idle times (showing up as flat spots) appear in the data. These flat spots are when the disp values go from > 520 to ~ 225. I would like to create a new column that acts as a label for these times.
This is a small, condensed excerpt of some of my data. The full dataset has about 270,000 rows and can be noisy. The start of the idle period is row 150 and the end of the idle period is row 157.
disp temp
148 528.253551 0.908375
149 537.832220 0.944138
150 225.073475 0.890493
151 225.247861 0.892878
152 225.487333 0.895262
153 225.515279 0.926256
154 225.515003 0.909567
155 225.518680 0.901222
156 225.537893 0.927448
157 225.068878 0.865460
158 540.460613 0.873804
159 531.048884 0.865460
The output I would like to generate is
disp temp state
148 528.253551 0.908375 None
149 537.832220 0.944138 None
150 225.073475 0.890493 idl_strt
151 225.247861 0.892878 idle
152 225.487333 0.895262 idle
153 225.515279 0.926256 idle
154 225.515003 0.909567 idle
155 225.518680 0.901222 idle
156 225.537893 0.927448 idle
157 225.068878 0.865460 idl_ends
158 540.460613 0.873804 None
159 531.048884 0.865460 None
Is there a vectorized way to do this? Can functions called via the pd.apply method look at previous rows without first creating additional shifted columns?
I have already been able to populate the proper rows of the state column with "idl_start" and "idl_ends" using the following code
def _idlefinder(self, row):
"""Logic for finding idles start and end"""
if (row["disp"] + 250.0) < row["nextdisp"]: # find large drops in disp value
return "idle_strt"
elif row["disp"] + 250 < row["lastdisp"]: # find large increases in disp value
return "idle_end"
df["lastdisp"] = df["disp"].shift(1)
df["nextdisp"] = df["disp"].shift(-1)
df["status"] = df.apply(_idlefinder, axis=1) # label idle_start and idle_end
df.drop(columns=["nextdisp", "lastdisp"], inplace=True)
But this is both an inelegant and incomplete solution. The crucial part of what I am asking is how to populate the rows between "idl_strt" and "idl_ends" because remaining sections of the data (upswings and pauses with noisy readings) will have to be identified and labeled as well. My preference is not to go through the data row by row since I am using a data frame.
Previous versions of this program did parse this data row by row in lists, but became a bit verbose and involved to tweak or modify.
Upvotes: 1
Views: 43
Reputation: 262214
You have multiple ways to achieve this vectorially. Here is one example that identifies the flat regions based on the values close to 225 (with a tolerance of ±2), and highlights the start/end if the previous/next value is >520. It uses shift
and numpy.select
:
def idle_finder(col, flat_value=225, flat_tol=2, high_thresh=520):
m1 = col.gt(high_thresh)
m2 = col.sub(flat_value).abs().lt(flat_tol)
return pd.Series(np.select([m1.shift()&m2, m1.shift(-1)&m2, m2],
['idl_strt', 'idl_ends', 'idle'],
''),
index=col.index
).replace('', None)
df['status'] = idle_finder(df['disp'])
Output:
disp temp status
148 528.253551 0.908375 None
149 537.832220 0.944138 None
150 225.073475 0.890493 idl_strt
151 225.247861 0.892878 idle
152 225.487333 0.895262 idle
153 225.515279 0.926256 idle
154 225.515003 0.909567 idle
155 225.518680 0.901222 idle
156 225.537893 0.927448 idle
157 225.068878 0.865460 idl_ends
158 540.460613 0.873804 None
159 531.048884 0.865460 None
If you want to ensure that a idle period starts and ends after/before a value greater than 520, you can use an additional mask:
def idle_finder(col, flat_value=225, flat_tol=2, high_thresh=520):
m1 = col.gt(high_thresh)
m2 = col.sub(flat_value).abs().lt(flat_tol)
m3 = m1.mask(m2)
m4 = m3.ffill() & m3.bfill() & m2
return pd.Series(np.select([m1.shift()&m4, m1.shift(-1)&m4, m4],
['idl_strt', 'idl_ends', 'idle'],
''),
index=col.index
).replace('', None)
df['status'] = idle_finder(df['disp'])
Output:
disp temp status
148 528.253551 0.908375 None
149 337.832220 0.944138 None
150 225.073475 0.890493 None
151 225.247861 0.892878 None
152 225.487333 0.895262 None
153 225.515279 0.926256 None
154 525.515003 0.909567 None
155 225.518680 0.901222 idl_strt
156 225.537893 0.927448 idle
157 225.068878 0.865460 idl_ends
158 540.460613 0.873804 None
159 531.048884 0.865460 None
Intermediates:
disp temp status m1 m2 m1.shift()&m2 m1.shift(-1)&m2 m3 m4 m1.shift()&m4 m1.shift(-1)&m4
148 528.253551 0.908375 None True False False False True False False False
149 337.832220 0.944138 None False False False False False False False False
150 225.073475 0.890493 None False True False False NaN False False False
151 225.247861 0.892878 None False True False False NaN False False False
152 225.487333 0.895262 None False True False False NaN False False False
153 225.515279 0.926256 None False True False True NaN False False False
154 525.515003 0.909567 None True False False False True False False False
155 225.518680 0.901222 idl_strt False True True False NaN True True False
156 225.537893 0.927448 idle False True False False NaN True False False
157 225.068878 0.865460 idl_ends False True False True NaN True False True
158 540.460613 0.873804 None True False False False True False False False
159 531.048884 0.865460 None True False False False True False False False
Upvotes: 0