intStdu
intStdu

Reputation: 291

Pandas - Remove row if a specific value if repeated in a column and keep first

Imagine we have a dataframe:

   num  line    
0   1    56
1   1    90  
2   2    66  
3   3    4  
4   3    55  
5   3    104
6   1    23  
7   5    22  
8   3    144 

I want to remove the rows where specifically a 3 is repeated in the num column, and keep the first. So the two rows with repeating 1's in the num column should still be in the resulting DataFrame together with all the other columns.

What I have so far, which removes every double value, not only the 3's:

data.groupby((data['num'] != data['num'].shift()).cumsum().values).first()

Expected result or correct code:

   num  line    
0   1    56
1   1    90  
2   2    66  
3   3    4  
4   1    23  
5   5    22  
6   3    144 

Upvotes: 1

Views: 61

Answers (2)

jezrael
jezrael

Reputation: 862591

Use:

df = data[data['num'].ne(3) | data['num'].ne(data['num'].shift())]
print (df)
   num  line
0    1    56
1    1    90
2    2    66
3    3     4
6    1    23
7    5    22
8    3   144

Detail:

Compare for not equal:

print (data['num'].ne(3))
0     True
1     True
2     True
3    False
4    False
5    False
6     True
7     True
8    False
Name: num, dtype: bool

Compare by shifted values for first consecutive:

print (data['num'].ne(data['num'].shift()))
0     True
1    False
2     True
3     True
4    False
5    False
6     True
7     True
8     True
Name: num, dtype: bool

Chain by | for bitwise OR:

print (data['num'].ne(3) | data['num'].ne(data['num'].shift()))
0     True
1     True
2     True
3     True
4    False
5    False
6     True
7     True
8     True
Name: num, dtype: bool

Upvotes: 3

yatu
yatu

Reputation: 88236

You could use the bellow conditions in order to perform boolean indexation in the dataframe:

# True where num is 3
c1 = df['num'].eq(3)
# True where num is repeated
c2 = df['num'].eq(df['num'].shift(1))
# boolean indexation on df
df[(c1 & ~c2) | ~(c1)]

    num  line
0    1    56
1    1    90
2    2    66
3    3     4
6    1    23
7    5    22
8    3   144

Details

df.assign(is_3=c1, is_repeated=c2, filtered=(c1 & ~c2) | ~(c1))

   num  line   is_3  is_repeated  filtered
0    1    56  False        False      True
1    1    90  False         True      True
2    2    66  False        False      True
3    3     4   True        False      True
4    3    55   True         True     False
5    3   104   True         True     False
6    1    23  False        False      True
7    5    22  False        False      True
8    3   144   True        False      True

Upvotes: 2

Related Questions