Reputation: 85
I've got a dataframe like this
Day,Minute,Second,Value
1,1,0,1
1,2,1,2
1,3,1,2
1,2,6,0
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2
Sometimes the sensor records incorrect values and gets added again but with the correct value. For example, here we should delete the second and third rows since they are being overridden by row four coming from a timestamp before them. How do I filter out the 'bad' rows like those that are unnecessary? For the example, the expected output should be:
Day,Minute,Second,Value
1,1,0,1
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2
Here's the pseudocode for an iterative solution(Sorry for no indents in the formatting this is my first post)
for row in dataframe:
for previous_row in rows in dataframe before row:
if previous_row > row:
delete previous row
I think there should be a vectorized solution, especially for the second loop. I also don't want to modify what I'm iterating over but I'm not sure there is another option other than duplicating the dataframe.
Here is some starter code to work with the example dataframe
import pandas as pd
data = [{'Day':1, 'Minute':1, 'Second':0, 'Value':1},
{'Day':1, 'Minute':2, 'Second':1, 'Value':2},
{'Day':1, 'Minute':2, 'Second':6, 'Value':2},
{'Day':1, 'Minute':3, 'Second':1, 'Value':0},
{'Day':1, 'Minute':2, 'Second':1, 'Value':1},
{'Day':1, 'Minute':2, 'Second':5, 'Value':1},
{'Day':2, 'Minute':0, 'Second':1, 'Value':1},
{'Day':2, 'Minute':0, 'Second':5, 'Value':2}]
df = pd.DataFrame(data)
Upvotes: 0
Views: 149
Reputation: 1614
If you have multiple rows for the same combination of Day, Minute, Second
but a different Value
, I am assuming you want to retain the last recorded value and discard all the previous ones considering they are "bad".
You can do this simply by using drop_duplicates
:
df.drop_duplicates(subset=['Day', 'Minute', 'Second'], keep='last')
UPDATE v2:
If you need to retain the last group of ['Minute', 'Second']
combinations for each day, identify monotonically increasing Minute
groups (since it's the bigger time unit of the two) and select the group with the max
value of Group_Id
for each ['Day']
:
res = pd.DataFrame()
for _, g in df.groupby(['Day']):
g['Group_Id'] = (g.Minute.diff() < 0).cumsum()
res = pd.concat([res, g[g['Group_Id'] == max(g['Group_Id'].values)]])
OUTPUT:
Day Minute Second Value Group_Id
1 2 1 1 1
1 2 5 1 1
2 0 1 1 0
2 0 5 2 0
Upvotes: 1