Johnny
Johnny

Reputation: 43

Delete variables based on their duplicate consecutive values in pandas

Suppose I have a dataframe in python with index, variable and value columns. I would like to delete the rows in which the variable has the same value as a previous instant. In other words, I would like to get another dataframe with variables whose values are changing. Example,
Original dataframe:

Index  var  val
0      x1   10
1      x2   11
2      x1   10
3      x3   12
4      x1   13
5      x3   14
6      x2   12
7      x1   14

Output table:

Index  var  val
0      x1   10
1      x2   11
2      x3   12
3      x1   13
4      x3   14
5      x2   12
6      x1   14

What would be the most efficient way to do this?, imagine if I have a lot of variables (1000).
Cheers.

Upvotes: 0

Views: 366

Answers (2)

cardamom
cardamom

Reputation: 7441

I found a step by step way which creates a few True False 'helper' columns, after sorting and then finally reindexes it and returns just the desired columns.

df = df.sort_values(['var', 'val'])
df['original_index'] = df.index
df = df.reset_index()
df['varHasChanged'] = df['var'] != df['var'].shift(+1)
df['valhasChanged'] = df['val'] != df['val'].shift(+1)
df = df[~((df.varHasChanged == False) & (df.valhasChanged == False))]
df = df.sort_values(['Index'])
df = df.set_index(['Index'])
df = df[['var', 'val']]
df

Upvotes: 0

Maarten Fabré
Maarten Fabré

Reputation: 7058

pandas.transform to the rescue.

A good explanation of transform can be found here

duplicates = df.groupby('var')['val'].transform(lambda x: x.diff()==0)
df[~duplicates].reset_index(drop=True)

duplicates

Index
0    False
1    False
2     True
3    False
4    False
5    False
6    False
7    False

result

    var     val
0   x1  10
1   x2  11
2   x3  12
3   x1  13
4   x3  14
5   x2  12
6   x1  14

Upvotes: 1

Related Questions