TomCho
TomCho

Reputation: 3507

Change value if consecutive number of certain condition is achieved in Pandas

I would to change the value of certain DataFrame values only if a certain condition is met an n number of consecutive times.

Example:

df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12,0]=-40
df.iloc[10:12,1]=-40

Which gives me this DF:

            0          1         2
0    1.238892   0.802318 -0.013856
1   -1.136326  -0.527263 -0.260975
2    1.118771   0.031517  0.527350
3    1.629482  -0.158941 -1.045209
4   40.000000   0.598657 -1.268399
5   40.000000   0.442297 -0.016363
6   40.000000  -0.316817  1.744822
7   40.000000   0.193083  0.914172
8    0.322756  -0.680682  0.888702
9   -1.204531  -0.240042  1.416020
10  -1.337494 -40.000000 -1.195780
11  -0.703669 -40.000000  0.657519
12 -40.000000  -0.288235 -0.840145
13  -1.084869  -0.298030 -1.592004
14  -0.617568  -1.046210 -0.531523

Now, if I do

a=df.copy()
a[ abs(a) > abs(a.std()) ] = float('nan')

I get

           0         1         2
0   1.238892  0.802318 -0.013856
1  -1.136326 -0.527263 -0.260975
2   1.118771  0.031517  0.527350
3   1.629482 -0.158941       NaN
4        NaN  0.598657       NaN
5        NaN  0.442297 -0.016363
6        NaN -0.316817       NaN
7        NaN  0.193083  0.914172
8   0.322756 -0.680682  0.888702
9  -1.204531 -0.240042       NaN
10 -1.337494       NaN       NaN
11 -0.703669       NaN  0.657519
12       NaN -0.288235 -0.840145
13 -1.084869 -0.298030       NaN
14 -0.617568 -1.046210 -0.531523

which is fair. However, I would like only to replace the values with NaN if these conditions were met by a maximum of 2 consecutive entries (so I can interpolate later). For example, I wanted the result to be

            0          1         2
0    1.238892   0.802318 -0.013856
1   -1.136326  -0.527263 -0.260975
2    1.118771   0.031517  0.527350
3    1.629482  -0.158941       NaN
4   40.000000   0.598657       NaN
5   40.000000   0.442297 -0.016363
6   40.000000  -0.316817       NaN
7   40.000000   0.193083  0.914172
8    0.322756  -0.680682  0.888702
9   -1.204531  -0.240042       NaN
10  -1.337494        NaN       NaN
11  -0.703669        NaN  0.657519
12        NaN  -0.288235 -0.840145
13  -1.084869  -0.298030       NaN
14  -0.617568  -1.046210 -0.531523

Apparently there's no ready-to-use method to do this. The solution I found that closest resembles my problem was this one, but I couldn't make it work for me.

Any ideas?

Upvotes: 2

Views: 1631

Answers (1)

chrisb
chrisb

Reputation: 52236

See below - the tricky part is (cond[c] != cond[c].shift(1)).cumsum() which breaks the data into contiguous runs of the same value.

In [23]: cond = abs(df) > abs(df.std())

In [24]: for c in df.columns:
    ...:     grouper = (cond[c] != cond[c].shift(1)).cumsum() * cond[c]
    ...:     fill = (df.groupby(grouper)[c].transform('size') <= 2)
    ...:     df.loc[fill, c] = np.nan

In [25]: df
Out[25]: 
            0         1         2
0    1.238892  0.802318 -0.013856
1   -1.136326 -0.527263 -0.260975
2    1.118771  0.031517  0.527350
3    1.629482 -0.158941       NaN
4   40.000000  0.598657       NaN
5   40.000000  0.442297 -0.016363
6   40.000000 -0.316817       NaN
7   40.000000  0.193083  0.914172
8    0.322756 -0.680682  0.888702
9   -1.204531 -0.240042       NaN
10  -1.337494       NaN       NaN
11  -0.703669       NaN  0.657519
12        NaN -0.288235 -0.840145
13  -1.084869 -0.298030       NaN
14  -0.617568 -1.046210 -0.531523

To explain a bit more, cond[c] is a boolean series indicating whether your condition is true or not.

The cond[c] != cond[c].shift(1) compares the current row's condition to the next row's. This has the effecting of 'marking' where a run of values begins with the value True.

The .cumsum() converts the bools to integers and takes the cumulative sum. It may not be immediately intuitive, but this 'numbers' the groups of contiguous values. Finally the * cond[c] reassigns all groups that didn't meet the criteria to 0 (using False == 0)

So now you have groups of contiguous numbers that meet your condition, the next step performs a groupby to count how many values are in each group (transform('size').

Finally a new bool condition is used to assign missing values to those groups with 2 or less values meeting the condition.

Upvotes: 3

Related Questions