Mainland
Mainland

Reputation: 4584

Python Dataframe delete rows after comparing multiple column values with a value

I have data frame of many columns consisting float values. I want to delete a row if any of the columns have value below 20.

code:

xdf = pd.DataFrame({'A':np.random.uniform(low=-50, high=53.3, size=(5)),'B':np.random.uniform(low=10, high=130, size=(5)),'C':np.random.uniform(low=-50, high=130, size=(5)),'D':np.random.uniform(low=-100, high=200, size=(5))})

xdf =  
           A          B           C           D
0  -9.270533  42.098425   91.125009  148.350655
1  17.771411  55.564825  106.396381  -89.082831
2 -22.602563  99.330643   17.590466   73.985202
3  15.890920  76.011631   52.366311  194.023063
4  35.202379  41.973846   32.576890  100.523902

# my code
xdf[xdf[cols].ge(20).all(axis=1)]

Out[17]: 
           A          B         C           D
4  35.202379  41.973846  32.57689  100.523902

Expected output: drop a row if any column has below 20 value

xdf =  
           A          B           C           D
4  35.202379  41.973846   32.576890  100.523902 

Is this the best way of doing it?

Upvotes: 2

Views: 94

Answers (3)

Derek O
Derek O

Reputation: 19610

You can use the numpy equivalent of .ge instead:

xdf.loc[np.greater(xdf,20).all(axis=1)]

Upvotes: 1

Z Li
Z Li

Reputation: 4318

To do it in numpy:

xdf = pd.DataFrame({'A':np.random.uniform(low=-50, high=53.3, size=(5)),'B':np.random.uniform(low=10, high=130, size=(5)),'C':np.random.uniform(low=-50, high=130, size=(5)),'D':np.random.uniform(low=-100, high=200, size=(5))})

%timeit xdf[xdf[['A','B','C','D']].ge(20).all(axis=1)]
%timeit xdf[(xdf[['A','B','C','D']].values >= 20).all(axis=1)]
705 µs ± 277 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
460 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you do not want to keep result in DataFrame this can even be faster:

xdf.values[(xdf[['A','B','C','D']].values >= 20).all(axis=1)]

Upvotes: 1

Daniel Seger
Daniel Seger

Reputation: 86

As numpy is lighter and therefore faster in terms of calculations with numbers, try this:

a = np.array([np.random.uniform(low=-50, high=53.3, size=(5)),
    np.random.uniform(low=10, high=130, size=(5)),
    np.random.uniform(low=-50, high=130, size=(5)),
    np.random.uniform(low=-100, high=200, size=(5))])

print(a[np.all(a > 20, axis=1)])

If you want to stick with pandas, another idea would be:

xdfFiltered = xdf.loc[(xdf["A"] > 20) & (xdf["B"] > 20) & (xdf["C"] > 20) & (xdf["D"] > 20)]

Upvotes: 1

Related Questions