RTM
RTM

Reputation: 789

How to filter pandas series values based on a condition

I have a pandas series as pd.Series([-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1, -1, -1 , -1]). How can I convert it in to pd.Series([-1, 0, 0, 0, -5, -5, 0, 0, 0, -1]).

The condition to filter is that if -1s are more than or equal to 3 in a streak, then keep the first occurrence and discard the rest.

Since the first -1s streak is 3, we keep -1 and discard the rest. After the first 3 values, the streak breaks (since the value is now 0). Similarly the last -1s streak is 4, so we keep the -1 and discard the rest.

The filter only applies to -1 and -5 should be left as is

Thanks

PS: I thought about groupby, but I think it doesnt honor the streak way that I described above

Upvotes: 2

Views: 1706

Answers (4)

Andy L.
Andy L.

Reputation: 25269

Create a boolean mask m to identify positions where values change. Groupby s on m.cumsum() with transform to identify groups having number of -1 < 3 and assign it to mask m1. Boolean m or m1 and cumsum to separate only groups-with-number -1 >= 3 into the same number. Finally, use duplicated to slice.

m = s.diff().ne(0)
m1 = s.groupby(m.cumsum()).transform(lambda x: x.eq(-1).sum() < 3)
m2 = ~((m | m1).cumsum().duplicated())
s[m2]

Step by step:
I modify your sample to include case -1 have 2 consecutive rows which we should keep.

s
Out[148]:
0    -1
1    -1
2    -1
3     0
4    -1
5    -1
6     0
7     0
8    -5
9    -5
10    0
11    0
12    0
13   -1
14   -1
15   -1
16   -1
dtype: int64

m = s.diff().ne(0)

Out[150]:
0      True
1     False
2     False
3      True
4      True
5     False
6      True
7     False
8      True
9     False
10     True
11    False
12    False
13     True
14    False
15    False
16    False
dtype: bool

m1 = s.groupby(m.cumsum()).transform(lambda x: x.eq(-1).sum() < 3)

Out[152]:
0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13    False
14    False
15    False
16    False
dtype: bool

m2 = ~((m | m1).cumsum().duplicated())

Out[159]:
0      True
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14    False
15    False
16    False
dtype: bool

In [168]: s[m2]
Out[168]:
0    -1
3     0
4    -1
5    -1
6     0
7     0
8    -5
9    -5
10    0
11    0
12    0
13   -1
dtype: int64

Upvotes: 0

Divakar
Divakar

Reputation: 221774

With some SciPy tools -

from scipy.ndimage.morphology import binary_opening,binary_erosion

def keep_first_neg1s(s, W=3):
    k1 = np.ones(W,dtype=bool)
    k2 = np.ones(2,dtype=bool)
    m = s==-1
    return s[~binary_erosion(binary_opening(m,k1),k2) | ~m]

A simpler one and hopefully more performant too -

def keep_first_neg1s_v2(s, W=3):
    m1 = binary_opening(a==-1, np.ones(W,dtype=bool))
    return s[np.r_[True,~m1[:-1]]]

Runs on given sample s -

# Using .tolist() simply for better visualization
In [47]: s.tolist()
Out[47]: [-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1, -1, -1, -1]

In [48]: keep_first_neg1s(s,W=3).tolist()
Out[48]: [-1, 0, 0, 0, -5, -5, 0, 0, 0, -1]

In [49]: keep_first_neg1s(s,W=4).tolist()
Out[49]: [-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1]

Upvotes: 2

rafaelc
rafaelc

Reputation: 59304

IIUC, pandas masking and groupby:

def remove_streaks(T):
  '''T is the threshold
  '''

  g = s.groupby(s.diff().ne(0).cumsum() + s.ne(-1).cumsum())
  mask = g.transform('size').lt(T).cumsum() + s.diff().ne(0).cumsum() 

  return s.groupby(mask).first()

>>> remove_streaks(4)
[-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1]

>>> remove_streaks(3)
[-1, 0, 0, 0, -5, -5, 0, 0, 0, -1]

Upvotes: 1

RomanPerekhrest
RomanPerekhrest

Reputation: 92904

With conditional mask:

In [43]: s = pd.Series([-1, -1, -1, 0, 0, 0, -5, -5, 0, 0, 0, -1, -1, -1 , -1])                                         

In [44]: m = (s.diff() == 0) & (s.eq(-1))                                                                               

In [45]: s[~m]                                                                                                          
Out[45]: 
0    -1
3     0
4     0
5     0
6    -5
7    -5
8     0
9     0
10    0
11   -1
dtype: int64

Upvotes: 2

Related Questions