Reputation: 43
Suppose I have a dataframe in python with index, variable and value columns. I would like to delete the rows in which the variable has the same value as a previous instant. In other words, I would like to get another dataframe with variables whose values are changing. Example,
Original dataframe:
Index var val
0 x1 10
1 x2 11
2 x1 10
3 x3 12
4 x1 13
5 x3 14
6 x2 12
7 x1 14
Output table:
Index var val
0 x1 10
1 x2 11
2 x3 12
3 x1 13
4 x3 14
5 x2 12
6 x1 14
What would be the most efficient way to do this?, imagine if I have a lot of variables (1000).
Cheers.
Upvotes: 0
Views: 366
Reputation: 7441
I found a step by step way which creates a few True False 'helper' columns, after sorting and then finally reindexes it and returns just the desired columns.
df = df.sort_values(['var', 'val'])
df['original_index'] = df.index
df = df.reset_index()
df['varHasChanged'] = df['var'] != df['var'].shift(+1)
df['valhasChanged'] = df['val'] != df['val'].shift(+1)
df = df[~((df.varHasChanged == False) & (df.valhasChanged == False))]
df = df.sort_values(['Index'])
df = df.set_index(['Index'])
df = df[['var', 'val']]
df
Upvotes: 0
Reputation: 7058
pandas.transform
to the rescue.
A good explanation of transform can be found here
duplicates = df.groupby('var')['val'].transform(lambda x: x.diff()==0)
df[~duplicates].reset_index(drop=True)
duplicates
Index
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
result
var val
0 x1 10
1 x2 11
2 x3 12
3 x1 13
4 x3 14
5 x2 12
6 x1 14
Upvotes: 1