ezgi uluhan
ezgi uluhan

Reputation: 21

delete values that repeat more than 3 times except for first one in Pandas DataFrame

I have a pandas DataFrame.For example,

        Date   Time    A     B   C 
0   1.1.2015  00:00    2    16  50  
1   1.1.2015  01:00    2     9  50   
2   1.1.2015  02:00    4     6  50   
3   1.1.2015  03:00    3     7  31  
4   1.1.2015  04:00    2     7  42    
5   1.1.2015  05:00    2     7  22    
6   1.1.2015  06:00    2     7  14  
7   1.1.2015  07:00    2    11  50    
8   1.1.2015  08:00    3    11  28   
9   1.1.2015  09:00    2    18  17  

I want to delete numbers that successively repeated more than 3 times but keep the first one. I need to delete:

1- rows 5,6 and 7 because there are four 2s in column A and i dont need the last three.

2- rows 4,5 and 6 because there are four 7s in column B

3- rows 1 and 2 because there are three 50s in column C

So the output i want is like:

           Date   Time    A     B   C 
0      1.1.2015  00:00    2    16  50  
1      1.1.2015  03:00    3     7  31 
2      1.1.2015  08:00    3    11  28   
3      1.1.2015  09:00    2    18  17 

I have searched similar problems and i find this one the most similar: "Removing values that repeat more than 5 times in Pandas DataFrame" and i tried to adjust it to my problem but i could not(i am a beginner at python). Can anyone help me?

Thanks.

Upvotes: 2

Views: 824

Answers (2)

Josh Anish
Josh Anish

Reputation: 109

def remover(x=[]):
    new = []
    cnt = Counter
    op = list(cnt(x).elements())
    op = np.array(op,dtype=np.int32)
    x = pd.Series(op)
    num = x.value_counts()
    num = np.array(num,dtype = np.int32)
    for val in num:
        if val >= 3:
            del(val)
    new.append(val) 
    print(new)

Upvotes: 0

Colonel Beauvel
Colonel Beauvel

Reputation: 31181

You can do, with itertools help:

import itertools
import numpy as np

def f(serie):
    xs = []
    for el, gr in itertools.groupby(serie):
        x = np.repeat(True, len(list(gr)))
        if len(x)>=3:
           x[1:]=False
        xs.append(x)
    return np.concatenate(xs)

df[df[['A','B','C']].apply(f, axis=0).apply(np.all, axis=1)]

#Out[64]:
#       Date   Time  A   B   C
#0  1.1.2015  00:00  2  16  50
#3  1.1.2015  03:00  3   7  31
#8  1.1.2015  08:00  3  11  28
#9  1.1.2015  09:00  2  18  17

The idea is to use the utility function f to count the number of consecutive element in a column and create the associated desired boolean mask - eg you can check the result of f(df['A']). And then aggregate these boolean masks using np.all to filter your original dataframe.

Upvotes: 1

Related Questions