Rivered
Rivered

Reputation: 789

How to filter rows from pandas dataframe that increase in value based on datetime column?

Imagine we have a dataframe like this:

import pandas as pd
df = pd.DataFrame()
df['filename'] = ['118_3.JPG', '118_3.JPG', '118_3.JPG', '118_3.JPG', '118_3.JPG', '15_7.JPG', '15_7.JPG', '15_7.JPG', '15_7.JPG', '15_7.JPG','203_4.JPG', '203_4.JPG', '203_4.JPG', '203_4.JPG', '203_4.JPG']
df['cvxh_len'] = [100, 200, 3000, 2800, 29, 200, 400, 2, 1, 0, 5000, 6000, 9000, 11000, 15000]
df['date'] = ["2018-12-14", "2018-12-15", "2018-12-16", "2018-12-17", "2018-12-18", "2018-12-14", "2018-12-15", "2018-12-16", "2018-12-17", "2018-12-18", "2018-12-14", "2018-12-15", "2018-12-16", "2018-12-17", "2018-12-18" ]
df["date"] = pd.to_datetime(df["date"])

df

filename    cvxh_len    date
118_3.JPG   100         2018-12-14
118_3.JPG   200         2018-12-15
118_3.JPG   3000        2018-12-16
118_3.JPG   2800        2018-12-17
118_3.JPG   29          2018-12-18
15_7.JPG    200         2018-12-14
15_7.JPG    400         2018-12-15
15_7.JPG    2           2018-12-16
15_7.JPG    1           2018-12-17
15_7.JPG    0           2018-12-18
203_4.JPG   5000        2018-12-14
203_4.JPG   6000        2018-12-15
203_4.JPG   9000        2018-12-16
203_4.JPG   11000       2018-12-17
203_4.JPG   15000       2018-12-18

How can we remove rows that have a decreasing cvxh_len value over time (date) for each unique filename so that we end up with the following:

filename    cvxh_len    date
118_3.JPG   100         2018-12-14
118_3.JPG   200         2018-12-15
118_3.JPG   3000        2018-12-16
15_7.JPG    200         2018-12-14
15_7.JPG    400         2018-12-15
203_4.JPG   5000        2018-12-14
203_4.JPG   6000        2018-12-15
203_4.JPG   9000        2018-12-16
203_4.JPG   11000       2018-12-17
203_4.JPG   15000       2018-12-18

Upvotes: 2

Views: 643

Answers (2)

Rivered
Rivered

Reputation: 789

I ended up writing a custom data processor, as for my case there were more variable columns such as cvxh_len, and the posted solutions weren't taking into account the dates from 2 or 3 days earlier if there was a higher value inbetween the lower ones in some cases. Also, it was better to replace the wrong values with NaN than to remove the row. My solution is definitely slower, but it does work

CheckList=["cvxh_len"]                                                  #Can add as many variables as needed

#If this is not the case we have to remove the row
for i in list(df['filename'].unique()):                                 #for every unique filename
  df2 = pd.DataFrame()                                                  #We create a new df
  for index, row in df.iterrows():                                      #We need to fill this df with the other
    if row["filename"] == i:                                            #Find all filenames that match unique
      row["index"]=index                                              
      df2 = pd.concat([df2, row.to_frame().T], ignore_index=True)       #Add series to dataframe
      df2.sort_values(by = 'date') 
  for idx, r in df2.iterrows():                                         #For every item in new df iterate
    for M in list(range(len(df2))):                                     #To check earlier dates we need to find length 
      for h in CheckList:                                               #Variables to check
        if int(idx-M) in list(range(len(df2))):                         #Check if the item exists we are checking
          try:  
            if int(df2.loc[[idx]][h]) < int(df2.loc[[idx-int(M)]][h]):  #If value was lower on earlier timepoint
              df.loc[df2.loc[[idx]]["index"], h]=np.nan                 #We have to replace it with NaN 
          except ValueError:                                            #We need except statement because
            pass                                                        #Some values might be NaN beforehand and can not be subtracted

print(df)
    filename    cvxh_len    date
0   118_3.JPG   100.0       2018-12-14
1   118_3.JPG   200.0       2018-12-15
2   118_3.JPG   3000.0      2018-12-16
3   118_3.JPG   NaN         2018-12-17
4   118_3.JPG   NaN         2018-12-18
5   15_7.JPG    200.0       2018-12-14
6   15_7.JPG    400.0       2018-12-15
7   15_7.JPG    NaN         2018-12-16
8   15_7.JPG    NaN         2018-12-17
9   15_7.JPG    NaN         2018-12-18
10  203_4.JPG   5000.0      2018-12-14
11  203_4.JPG   6000.0      2018-12-15
12  203_4.JPG   9000.0      2018-12-16
13  203_4.JPG   11000.0     2018-12-17
14  203_4.JPG   15000.0     2018-12-18

Upvotes: 0

n4321d
n4321d

Reputation: 1205

I think you are looking for this:

df.loc[df.cvxh_len.diff().fillna(0) >= 0]

To explain: you take the differential of the variable you dont want to decrease. if that is smaller than 0 it is decreasing.

then you reindex you data frame using the locations where the differential is bigger or same as 0

Upvotes: 1

Related Questions