Reputation: 789
Imagine we have a dataframe like this:
import pandas as pd
df = pd.DataFrame()
df['filename'] = ['118_3.JPG', '118_3.JPG', '118_3.JPG', '118_3.JPG', '118_3.JPG', '15_7.JPG', '15_7.JPG', '15_7.JPG', '15_7.JPG', '15_7.JPG','203_4.JPG', '203_4.JPG', '203_4.JPG', '203_4.JPG', '203_4.JPG']
df['cvxh_len'] = [100, 200, 3000, 2800, 29, 200, 400, 2, 1, 0, 5000, 6000, 9000, 11000, 15000]
df['date'] = ["2018-12-14", "2018-12-15", "2018-12-16", "2018-12-17", "2018-12-18", "2018-12-14", "2018-12-15", "2018-12-16", "2018-12-17", "2018-12-18", "2018-12-14", "2018-12-15", "2018-12-16", "2018-12-17", "2018-12-18" ]
df["date"] = pd.to_datetime(df["date"])
df
filename cvxh_len date
118_3.JPG 100 2018-12-14
118_3.JPG 200 2018-12-15
118_3.JPG 3000 2018-12-16
118_3.JPG 2800 2018-12-17
118_3.JPG 29 2018-12-18
15_7.JPG 200 2018-12-14
15_7.JPG 400 2018-12-15
15_7.JPG 2 2018-12-16
15_7.JPG 1 2018-12-17
15_7.JPG 0 2018-12-18
203_4.JPG 5000 2018-12-14
203_4.JPG 6000 2018-12-15
203_4.JPG 9000 2018-12-16
203_4.JPG 11000 2018-12-17
203_4.JPG 15000 2018-12-18
How can we remove rows that have a decreasing cvxh_len value over time (date) for each unique filename so that we end up with the following:
filename cvxh_len date
118_3.JPG 100 2018-12-14
118_3.JPG 200 2018-12-15
118_3.JPG 3000 2018-12-16
15_7.JPG 200 2018-12-14
15_7.JPG 400 2018-12-15
203_4.JPG 5000 2018-12-14
203_4.JPG 6000 2018-12-15
203_4.JPG 9000 2018-12-16
203_4.JPG 11000 2018-12-17
203_4.JPG 15000 2018-12-18
Upvotes: 2
Views: 643
Reputation: 789
I ended up writing a custom data processor, as for my case there were more variable columns such as cvxh_len, and the posted solutions weren't taking into account the dates from 2 or 3 days earlier if there was a higher value inbetween the lower ones in some cases. Also, it was better to replace the wrong values with NaN than to remove the row. My solution is definitely slower, but it does work
CheckList=["cvxh_len"] #Can add as many variables as needed
#If this is not the case we have to remove the row
for i in list(df['filename'].unique()): #for every unique filename
df2 = pd.DataFrame() #We create a new df
for index, row in df.iterrows(): #We need to fill this df with the other
if row["filename"] == i: #Find all filenames that match unique
row["index"]=index
df2 = pd.concat([df2, row.to_frame().T], ignore_index=True) #Add series to dataframe
df2.sort_values(by = 'date')
for idx, r in df2.iterrows(): #For every item in new df iterate
for M in list(range(len(df2))): #To check earlier dates we need to find length
for h in CheckList: #Variables to check
if int(idx-M) in list(range(len(df2))): #Check if the item exists we are checking
try:
if int(df2.loc[[idx]][h]) < int(df2.loc[[idx-int(M)]][h]): #If value was lower on earlier timepoint
df.loc[df2.loc[[idx]]["index"], h]=np.nan #We have to replace it with NaN
except ValueError: #We need except statement because
pass #Some values might be NaN beforehand and can not be subtracted
print(df)
filename cvxh_len date
0 118_3.JPG 100.0 2018-12-14
1 118_3.JPG 200.0 2018-12-15
2 118_3.JPG 3000.0 2018-12-16
3 118_3.JPG NaN 2018-12-17
4 118_3.JPG NaN 2018-12-18
5 15_7.JPG 200.0 2018-12-14
6 15_7.JPG 400.0 2018-12-15
7 15_7.JPG NaN 2018-12-16
8 15_7.JPG NaN 2018-12-17
9 15_7.JPG NaN 2018-12-18
10 203_4.JPG 5000.0 2018-12-14
11 203_4.JPG 6000.0 2018-12-15
12 203_4.JPG 9000.0 2018-12-16
13 203_4.JPG 11000.0 2018-12-17
14 203_4.JPG 15000.0 2018-12-18
Upvotes: 0
Reputation: 1205
I think you are looking for this:
df.loc[df.cvxh_len.diff().fillna(0) >= 0]
To explain: you take the differential of the variable you dont want to decrease. if that is smaller than 0 it is decreasing.
then you reindex you data frame using the locations where the differential is bigger or same as 0
Upvotes: 1