Pet
Pet

Reputation: 251

How to filter a dataframe efficient when I have to compare values of two columns?

at the moment I am working on a project and ran in some efficency issue. At some point I have a huge dataframe and need to drop rows. My dataframe contains two columns where the value is either '0', '1' or an other np.nan.

I need to drop all rows where the first column is '0' and the other is '1' or vice versa. I want to keep all other combinations.

I wrote a function that does the job but it is highly inefficient and takes ages. Here is an example of an input, my function and the expected output.

data = {'val': ['a', 'b', 'c'],
         'compare1': ['0', '0', '1'],
        'compare2': ['0', None,  '0']
        }

df = pd.DataFrame (data)
df

>       val     compare1    compare2
>  0    a       0           0
>  1    b       0           None
>  2    c       1           0

my function:

def filter_df(df, colname1, colname2):
    for index, row in df.iterrows():
        if (row[colname1] == "0") and (row[colname2] == "1"):
            df.drop(index, inplace=True)
        if (row[colname1] == "1") and (row[colname2] == "0"):
            df.drop(index, inplace=True)
    return df

the output I want to have and get with the fkt:

df_new = filter_df(df,'compare1', 'compare2')
df_new

>       val     compare1    compare2
>  0    a       0           0
>  1    b       0           None

If you have an idea to make it more eficient please let me know :-). I am very happy for all kind of help.

Best P

Upvotes: 1

Views: 43

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

You could convert the dataframe to numeric types :

df = df.transform(pd.to_numeric, errors="ignore")
df


  val   compare1    compare2
0   a      0        0.0
1   b      0        NaN
2   c      1        0.0

Then filter where the sum of the compare columns is equal to 1 :

df.loc[~df.compare1.add(df.compare2).eq(1)]


   val  compare1    compare2
0   a       0       0.0
1   b       0       NaN

Upvotes: 1

adir abargil
adir abargil

Reputation: 5745

you can use conditions like this:

mask1 = ~((df['compare1'] == "0") & (df['compare2'] == "1")) 
mask2 = ~((df['compare1'] == "1") & (df['compare2'] == "0"))
df = df[mask1 & mask2 ]

if the performance is still not efficient for you I would suggest you to move to numpy for better performance...

Upvotes: 1

Related Questions