zorals
zorals

Reputation: 309

Dropping value in a dataframe in a loop

I have a dataframe with sorted values:

import numpy as np
import pandas as pd

sub_run = pd.DataFrame({'Runoff':[45,10,5,26,30,23,35], 'ind':[3, 10, 25,43,53,60,93]})

I would like to start from the highest value in Runoff (45), drop all values with which the difference in "ind" is less than 30 (10, 5), reupdate the DataFrame , then go to the second highest value (35): drop the indices with which the difference in "ind" is < 30 , then the the third highest value (30) and drop 26 and 23... I wrote the following code :

pre_ind = []

for (idx1, row1) in sub_run.iterrows():
     var = row1.ind
     pre_ind.append(np.array(var))
     for (idx2,row2) in sub_run.iterrows():
         if (row2.ind != var) and (row2.ind not in pre_ind):
            test = abs(row2.ind - var)
            print("test" , test)
            if test <= 30:
                 sub_run = sub_run.drop(sub_run[sub_run.ind == row2.ind].index)
                 

I expect to find as an output the values [45,35,30]. However I only find the first one.

Many thanks

Upvotes: 3

Views: 89

Answers (2)

kuriboh
kuriboh

Reputation: 141

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html

In your case, the modification of sub_run has no effect immediately on the iteration. Therefore, in the outer loop, after iteration on 45, 3, the next row iterated is 35, 93, followed by 30, 53, 26, 43, 23, 60, 10, 10, 5, 25. For the inner loop, your modification works since you re-enter a new loop through iteration on the outer loop.

Here is my advice code, inspired by bubble sort.

import pandas as pd

sub_run = pd.DataFrame({'Runoff': [45,10,5,26,30,23,35],
                        'ind': [3,10,25,43,53,60,93]})


sub_run = sub_run.sort_values(by=['Runoff'], ascending=False)
highestRow = 0

while highestRow < len(sub_run) - 1:
    cur_run = sub_run
    highestRunoffInd = cur_run.iloc[highestRow].ind
    for i in range(highestRow + 1, len(cur_run)):
        ind = cur_run.iloc[i].ind
        if abs(ind - highestRunoffInd) <= 30:
            sub_run = sub_run.drop(sub_run[sub_run.ind == ind].index)
    highestRow += 1
print(sub_run)

Output:

   Runoff  ind
0      45    3
6      35   93
4      30   53

Upvotes: 1

bpfrd
bpfrd

Reputation: 1025

Try this:

list_pre_max = []
while True:
    
    try:
        max_val = sub_run.Runoff.sort_values(ascending=False).iloc[len(list_pre_max)]
    except:
        break
    max_ind = sub_run.loc[sub_run['Runoff'] == max_val, 'ind'].item()
    list_pre_max.append(max_val)
    dropped_indices = sub_run.loc[(abs(sub_run['ind']-max_ind) <= 30) & (sub_run['ind'] != max_ind) & (~sub_run.Runoff.isin(list_pre_max))].index
    
    sub_run.drop(index=dropped_indices, inplace=True)

Output:

>>>sub_run
        Runoff  ind
0   45  3
4   30  53
6   35  93

Upvotes: 1

Related Questions