Replacing more than n consecutive values in Pandas DataFrame column

Supposing I have the following DataFrame df

df = pd.DataFrame({"a" : [1,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5], "b" : [3,3,3,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,7,7], "c" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,1,2,2,2,2,2,2,2,2,3,3]})

And I wish to replace number 4's which repeat more than 10 times in a row, in any column (there could be hundreds of columns), with 10 4's and the remainder 5's.

So for example, 12 consecutive 4's would be replaced with ten 4's and two 5's.

How would I achieve this with Pandas?

I'd like to apply a lambda, but I don't know how to look back enough rows, and it would have to start from the end and move through forward, or it would break the sequence of values. Each lookup would have to look at the preceding 10 rows to see if they are all equal to 4, and if so, set the current value to 5.

No idea how to go about it though!

Upvotes: 4

Answers (3)

Zeugma

Reputation: 32095

Remove all the 4s, fillna back with 4s using limit=10 as argument, and remove the remaining NA with 5s. I find this method more explicit and reflecting more your intention:

df[df!=4].fillna(4, limit=10).fillna(5)

If needed cast the df back to integers with astype(int) at the end, as the intrusion of NAs will cast the dataframe into floats.

Upvotes: 3

jezrael

Reputation: 862611

You can use:

#column a is changed for 2 groups of 4
df = pd.DataFrame({
"a" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,7,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5], 
"b" : [3,3,3,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,7,7], 
"c" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,1,2,2,2,2,2,2,2,2,3,3]})

Solution counts consecutives 4 with reset if NaN created by where and then apply boolean mask to original df for replace 4 to 5 by mask:

a = df == 4
mask = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0) > 10
df1 = df.mask(mask, 5)

print (df1)
    a  b  c
0   4  3  4
1   4  3  4
2   4  3  4
3   4  3  4
4   4  3  4
5   4  3  4
6   4  3  4
7   4  4  4
8   4  4  4
9   4  4  4
10  5  4  5
11  5  5  5
12  5  5  5
13  5  5  5
14  7  5  5
15  4  5  5
16  4  5  5
17  4  5  5
18  4  5  5
19  4  5  5
20  4  5  5
21  4  5  1
22  4  5  2
23  4  5  2
24  4  5  2
25  5  5  2
26  5  5  2
27  5  5  2
28  5  6  2
29  5  6  2
30  5  7  3
31  5  7  3

For better checking values is possible use concat:

print (pd.concat([df, df1], axis=1, keys=['orig','new']))
   orig       new      
      a  b  c   a  b  c
0     4  3  4   4  3  4
1     4  3  4   4  3  4
2     4  3  4   4  3  4
3     4  3  4   4  3  4
4     4  3  4   4  3  4
5     4  3  4   4  3  4
6     4  3  4   4  3  4
7     4  4  4   4  4  4
8     4  4  4   4  4  4
9     4  4  4   4  4  4
10    4  4  4   5  4  5
11    4  5  4   5  5  5
12    4  5  4   5  5  5
13    4  5  4   5  5  5
14    7  5  4   7  5  5
15    4  5  4   4  5  5
16    4  5  4   4  5  5
17    4  5  4   4  5  5
18    4  5  5   4  5  5
19    4  5  5   4  5  5
20    4  5  5   4  5  5
21    4  5  1   4  5  1
22    4  5  2   4  5  2
23    4  5  2   4  5  2
24    4  5  2   4  5  2
25    4  5  2   5  5  2
26    4  5  2   5  5  2
27    4  5  2   5  5  2
28    4  6  2   5  6  2
29    5  6  2   5  6  2
30    5  7  3   5  7  3
31    5  7  3   5  7  3

Upvotes: 4

Alex Fung

Reputation: 2006

This should do the trick:

import pandas as pd

df = pd.DataFrame({"a" : [1,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5], "b" : [3,3,3,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,7,7], "c" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,1,2,2,2,2,2,2,2,2,3,3]})

def replacer(l,target_val=4,replace_val=5,repeat_max=10):
    counter = 0
    new_l = []
    for e in l:
        if e == target_val: counter += 1
        else:
            counter = 0

        if counter > repeat_max:
            new_l.append(replace_val)
        else:
            new_l.append(e)

    return new_l

df1 = df.apply(replacer)

Output:

    a  b  c
0   1  3  4
1   2  3  4
2   2  3  4
3   2  3  4
4   2  3  4
5   2  3  4
6   2  3  4
7   2  4  4
8   2  4  4
9   3  4  4
10  3  4  5
11  4  5  5
12  4  5  5
13  4  5  5
14  4  5  5
15  4  5  5
16  4  5  5
17  4  5  5
18  4  5  5
19  4  5  5
20  4  5  5
21  5  5  1
22  5  5  2
23  5  5  2
24  5  5  2
25  5  5  2
26  5  5  2
27  5  5  2
28  5  6  2
29  5  6  2
30  5  7  3
31  5  7  3

Upvotes: 1

Replacing more than n consecutive values in Pandas DataFrame column

Answers (3)

Related Questions