Reputation: 337

Replacing Values in a Column on the Basis of n Consecutive Entries

I have a column of data with values ranging from 0, 1, & 2. I would like to replace all instances of 1 and 2 with the value of 0 if these occurrences do not persist for n rows. If the values persist, for n rows, I would like to leave them as-is. Example, where n=4:

df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2],
                   'desired': [0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,0,0,0,0,0,0]})

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

The intended functionality is to essentially "filter" the underlying data for brief changes in values. I'd like to be able to declare the number of consecutive values necessary which permit keeping the underlying data, including all values below and above 'n'. (If n = 4 and there are 6 consecutive values, I want all 6, not just the 2 that persisted beyond the cutoff of 4.). Is there a vectorized way of doing this in Pandas?

Upvotes: 4

Answers (3)

Erfan

Reputation: 42906

Using the cumsum of the difference, then get the size of each group:

n = 4 
groups = df['data'].diff().ne(0).cumsum()
df['desired'] = df['data'].where(df.groupby(groups)['data'].transform('size').gt(n), other=0)

    data  desired
0   1     0      
1   0     0      
2   1     0      
3   2     0      
4   0     0      
5   0     0      
6   0     0      
7   1     0      
8   0     0      
9   2     2      
10  2     2      
11  2     2      
12  2     2      
13  2     2      
14  2     2      
15  0     0      
16  1     0      
17  0     0      
18  1     0      
19  0     0      
20  2     0

`Timings`:

# create sample dataframe of 1 million rows
df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2]})
dfbig = pd.concat([df]*50000, ignore_index=True)
dfbig.shape

(1050000, 1)

Erfan

%%timeit
n = 4

groups = dfbig['data'].diff().ne(0).cumsum()
dfbig['data'].where(dfbig.groupby(groups)['data'].transform('size').gt(4), other=0)

268 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Quang Hoang

%%timeit
n=4
s = dfbig['data'].diff().eq(0).rolling(n-1).sum()

# fill
np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), dfbig['data'], 0)

164 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

piRSquared

%%timeit
a = dfbig.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()

dfbig.data.where(np.bincount(b)[b] >= 4, 0)

62 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

`Conlusion`:

piRSquared
Quang Hoang
Erfan

Upvotes: 4

piRSquared

Reputation: 294258

`bincount`

a = df.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()

df.assign(desired=df.data.where(np.bincount(b)[b] >= 4, 0))

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

Upvotes: 3

Quang Hoang

Reputation: 150735

IIUC, you can do:

n = 6

# find where the values repeats n times
s = df['data'].diff().eq(0).rolling(n-1).sum()

# fill
df['desired'] = np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), df['data'], 0)

Output:

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

Upvotes: 4

Replacing Values in a Column on the Basis of n Consecutive Entries

Answers (3)

Timings:

Conlusion:

bincount

Related Questions

`Timings`:

`Conlusion`:

`bincount`