Riftwave
Riftwave

Reputation: 11

Pandas/Python creating new column with rows based based on condition of old column and previous row

I am trying to analyze some data in a dataframe and identify when idle times (showing up as flat spots) appear in the data. These flat spots are when the disp values go from > 520 to ~ 225. I would like to create a new column that acts as a label for these times.

This is a small, condensed excerpt of some of my data. The full dataset has about 270,000 rows and can be noisy. The start of the idle period is row 150 and the end of the idle period is row 157.

           disp      temp  
148  528.253551  0.908375
149  537.832220  0.944138
150  225.073475  0.890493
151  225.247861  0.892878
152  225.487333  0.895262
153  225.515279  0.926256
154  225.515003  0.909567
155  225.518680  0.901222
156  225.537893  0.927448
157  225.068878  0.865460
158  540.460613  0.873804
159  531.048884  0.865460

The output I would like to generate is

           disp      temp      state
148  528.253551  0.908375       None
149  537.832220  0.944138       None
150  225.073475  0.890493   idl_strt
151  225.247861  0.892878       idle
152  225.487333  0.895262       idle
153  225.515279  0.926256       idle
154  225.515003  0.909567       idle
155  225.518680  0.901222       idle
156  225.537893  0.927448       idle
157  225.068878  0.865460   idl_ends
158  540.460613  0.873804       None
159  531.048884  0.865460       None

Is there a vectorized way to do this? Can functions called via the pd.apply method look at previous rows without first creating additional shifted columns?

I have already been able to populate the proper rows of the state column with "idl_start" and "idl_ends" using the following code

def _idlefinder(self, row):
    """Logic for finding idles start and end"""
    if (row["disp"] + 250.0) < row["nextdisp"]:  # find large drops in disp value
        return "idle_strt"
    elif row["disp"] + 250 < row["lastdisp"]:    # find large increases in disp value
        return "idle_end"

df["lastdisp"] = df["disp"].shift(1)
df["nextdisp"] = df["disp"].shift(-1)
df["status"] = df.apply(_idlefinder, axis=1)  # label idle_start and idle_end
df.drop(columns=["nextdisp", "lastdisp"], inplace=True)

But this is both an inelegant and incomplete solution. The crucial part of what I am asking is how to populate the rows between "idl_strt" and "idl_ends" because remaining sections of the data (upswings and pauses with noisy readings) will have to be identified and labeled as well. My preference is not to go through the data row by row since I am using a data frame.

Previous versions of this program did parse this data row by row in lists, but became a bit verbose and involved to tweak or modify.

Upvotes: 1

Views: 43

Answers (1)

mozway
mozway

Reputation: 262214

You have multiple ways to achieve this vectorially. Here is one example that identifies the flat regions based on the values close to 225 (with a tolerance of ±2), and highlights the start/end if the previous/next value is >520. It uses shift and numpy.select:

def idle_finder(col, flat_value=225, flat_tol=2, high_thresh=520):
    m1 = col.gt(high_thresh)
    m2 = col.sub(flat_value).abs().lt(flat_tol)
    return pd.Series(np.select([m1.shift()&m2, m1.shift(-1)&m2, m2],
                               ['idl_strt', 'idl_ends', 'idle'],
                               ''),
                     index=col.index
                    ).replace('', None)
            
df['status'] = idle_finder(df['disp'])

Output:

           disp      temp    status
148  528.253551  0.908375      None
149  537.832220  0.944138      None
150  225.073475  0.890493  idl_strt
151  225.247861  0.892878      idle
152  225.487333  0.895262      idle
153  225.515279  0.926256      idle
154  225.515003  0.909567      idle
155  225.518680  0.901222      idle
156  225.537893  0.927448      idle
157  225.068878  0.865460  idl_ends
158  540.460613  0.873804      None
159  531.048884  0.865460      None

If you want to ensure that a idle period starts and ends after/before a value greater than 520, you can use an additional mask:

def idle_finder(col, flat_value=225, flat_tol=2, high_thresh=520):
    m1 = col.gt(high_thresh)
    m2 = col.sub(flat_value).abs().lt(flat_tol)
    m3 = m1.mask(m2)
    m4 = m3.ffill() & m3.bfill() & m2
    return pd.Series(np.select([m1.shift()&m4, m1.shift(-1)&m4, m4],
                               ['idl_strt', 'idl_ends', 'idle'],
                               ''),
                     index=col.index
                    ).replace('', None)
            
df['status'] = idle_finder(df['disp'])

Output:

           disp      temp    status
148  528.253551  0.908375      None
149  337.832220  0.944138      None
150  225.073475  0.890493      None
151  225.247861  0.892878      None
152  225.487333  0.895262      None
153  225.515279  0.926256      None
154  525.515003  0.909567      None
155  225.518680  0.901222  idl_strt
156  225.537893  0.927448      idle
157  225.068878  0.865460  idl_ends
158  540.460613  0.873804      None
159  531.048884  0.865460      None

Intermediates:

           disp      temp    status     m1     m2  m1.shift()&m2  m1.shift(-1)&m2     m3     m4  m1.shift()&m4  m1.shift(-1)&m4
148  528.253551  0.908375      None   True  False          False            False   True  False          False            False
149  337.832220  0.944138      None  False  False          False            False  False  False          False            False
150  225.073475  0.890493      None  False   True          False            False    NaN  False          False            False
151  225.247861  0.892878      None  False   True          False            False    NaN  False          False            False
152  225.487333  0.895262      None  False   True          False            False    NaN  False          False            False
153  225.515279  0.926256      None  False   True          False             True    NaN  False          False            False
154  525.515003  0.909567      None   True  False          False            False   True  False          False            False
155  225.518680  0.901222  idl_strt  False   True           True            False    NaN   True           True            False
156  225.537893  0.927448      idle  False   True          False            False    NaN   True          False            False
157  225.068878  0.865460  idl_ends  False   True          False             True    NaN   True          False             True
158  540.460613  0.873804      None   True  False          False            False   True  False          False            False
159  531.048884  0.865460      None   True  False          False            False   True  False          False            False

Upvotes: 0

Related Questions