How to filter a dataframe efficient when I have to compare values of two columns?

Question

at the moment I am working on a project and ran in some efficency issue. At some point I have a huge dataframe and need to drop rows. My dataframe contains two columns where the value is either '0', '1' or an other np.nan.

I need to drop all rows where the first column is '0' and the other is '1' or vice versa. I want to keep all other combinations.

I wrote a function that does the job but it is highly inefficient and takes ages. Here is an example of an input, my function and the expected output.

data = {'val': ['a', 'b', 'c'],
         'compare1': ['0', '0', '1'],
        'compare2': ['0', None,  '0']
        }

df = pd.DataFrame (data)
df

>       val     compare1    compare2
>  0    a       0           0
>  1    b       0           None
>  2    c       1           0

my function:

def filter_df(df, colname1, colname2):
    for index, row in df.iterrows():
        if (row[colname1] == "0") and (row[colname2] == "1"):
            df.drop(index, inplace=True)
        if (row[colname1] == "1") and (row[colname2] == "0"):
            df.drop(index, inplace=True)
    return df

the output I want to have and get with the fkt:

df_new = filter_df(df,'compare1', 'compare2')
df_new

>       val     compare1    compare2
>  0    a       0           0
>  1    b       0           None

If you have an idea to make it more eficient please let me know :-). I am very happy for all kind of help.

Best P

adir abargil · Accepted Answer

you can use conditions like this:

mask1 = ~((df['compare1'] == "0") & (df['compare2'] == "1")) 
mask2 = ~((df['compare1'] == "1") & (df['compare2'] == "0"))
df = df[mask1 & mask2 ]

if the performance is still not efficient for you I would suggest you to move to numpy for better performance...

How to filter a dataframe efficient when I have to compare values of two columns?

Answers (2)

Related Questions