CSHhhh
CSHhhh

Reputation: 85

Pandas how to filter previous rows based on later row

I've got a dataframe like this

Day,Minute,Second,Value
1,1,0,1
1,2,1,2
1,3,1,2
1,2,6,0
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2

Sometimes the sensor records incorrect values and gets added again but with the correct value. For example, here we should delete the second and third rows since they are being overridden by row four coming from a timestamp before them. How do I filter out the 'bad' rows like those that are unnecessary? For the example, the expected output should be:

Day,Minute,Second,Value
1,1,0,1
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2

Here's the pseudocode for an iterative solution(Sorry for no indents in the formatting this is my first post)

for row in dataframe:
for previous_row in rows in dataframe before row:
if previous_row > row:
delete previous row

I think there should be a vectorized solution, especially for the second loop. I also don't want to modify what I'm iterating over but I'm not sure there is another option other than duplicating the dataframe.

Here is some starter code to work with the example dataframe

import pandas as pd
data = [{'Day':1, 'Minute':1, 'Second':0, 'Value':1},
{'Day':1, 'Minute':2, 'Second':1, 'Value':2}, 
{'Day':1, 'Minute':2, 'Second':6, 'Value':2}, 
{'Day':1, 'Minute':3, 'Second':1, 'Value':0},
{'Day':1, 'Minute':2, 'Second':1, 'Value':1},
{'Day':1, 'Minute':2, 'Second':5, 'Value':1}, 
{'Day':2, 'Minute':0, 'Second':1, 'Value':1}, 
{'Day':2, 'Minute':0, 'Second':5, 'Value':2}]

df = pd.DataFrame(data)

Upvotes: 0

Views: 149

Answers (1)

panktijk
panktijk

Reputation: 1614

If you have multiple rows for the same combination of Day, Minute, Second but a different Value, I am assuming you want to retain the last recorded value and discard all the previous ones considering they are "bad".

You can do this simply by using drop_duplicates:

df.drop_duplicates(subset=['Day', 'Minute', 'Second'], keep='last')

UPDATE v2:

If you need to retain the last group of ['Minute', 'Second'] combinations for each day, identify monotonically increasing Minute groups (since it's the bigger time unit of the two) and select the group with the max value of Group_Id for each ['Day']:

res = pd.DataFrame()
for _, g in df.groupby(['Day']):
    g['Group_Id'] = (g.Minute.diff() < 0).cumsum()
    res = pd.concat([res, g[g['Group_Id'] == max(g['Group_Id'].values)]])

OUTPUT:

Day Minute  Second  Value   Group_Id
1   2       1       1       1
1   2       5       1       1
2   0       1       1       0
2   0       5       2       0

Upvotes: 1

Related Questions