MattM
MattM

Reputation: 337

Replacing Values in a Column on the Basis of n Consecutive Entries

I have a column of data with values ranging from 0, 1, & 2. I would like to replace all instances of 1 and 2 with the value of 0 if these occurrences do not persist for n rows. If the values persist, for n rows, I would like to leave them as-is. Example, where n=4:

df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2],
                   'desired': [0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,0,0,0,0,0,0]})

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

The intended functionality is to essentially "filter" the underlying data for brief changes in values. I'd like to be able to declare the number of consecutive values necessary which permit keeping the underlying data, including all values below and above 'n'. (If n = 4 and there are 6 consecutive values, I want all 6, not just the 2 that persisted beyond the cutoff of 4.). Is there a vectorized way of doing this in Pandas?

Upvotes: 4

Views: 60

Answers (3)

Erfan
Erfan

Reputation: 42906

Using the cumsum of the difference, then get the size of each group:

n = 4 
groups = df['data'].diff().ne(0).cumsum()
df['desired'] = df['data'].where(df.groupby(groups)['data'].transform('size').gt(n), other=0)

    data  desired
0   1     0      
1   0     0      
2   1     0      
3   2     0      
4   0     0      
5   0     0      
6   0     0      
7   1     0      
8   0     0      
9   2     2      
10  2     2      
11  2     2      
12  2     2      
13  2     2      
14  2     2      
15  0     0      
16  1     0      
17  0     0      
18  1     0      
19  0     0      
20  2     0  

Timings:

# create sample dataframe of 1 million rows
df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2]})
dfbig = pd.concat([df]*50000, ignore_index=True)
dfbig.shape

(1050000, 1)

Erfan

%%timeit
n = 4

groups = dfbig['data'].diff().ne(0).cumsum()
dfbig['data'].where(dfbig.groupby(groups)['data'].transform('size').gt(4), other=0)

268 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Quang Hoang

%%timeit
n=4
s = dfbig['data'].diff().eq(0).rolling(n-1).sum()

# fill
np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), dfbig['data'], 0)

164 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

piRSquared

%%timeit
a = dfbig.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()

dfbig.data.where(np.bincount(b)[b] >= 4, 0)

62 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conlusion:

  1. piRSquared
  2. Quang Hoang
  3. Erfan

Upvotes: 4

piRSquared
piRSquared

Reputation: 294258

bincount

a = df.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()

df.assign(desired=df.data.where(np.bincount(b)[b] >= 4, 0))

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150735

IIUC, you can do:

n = 6

# find where the values repeats n times
s = df['data'].diff().eq(0).rolling(n-1).sum()

# fill
df['desired'] = np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), df['data'], 0)

Output:

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

Upvotes: 4

Related Questions