Exclude all rows with values over 100 except for one column

Question

I'm working with the open food facts dataset which is very very messy. It has lots of columns with the amount (in grams) of some of the respective ingredient (like sugar) per 100g. According to the overview (.describe()) the max for most of the column is more than 100, which is obviously impossible. This is true except for the energy column since energy is not measured in grams.

The min for most columns is also sometimes negative. However, this was simple to fix with: data = data[data >= 0]

I want to know how I can do the same with respect to eliminating all rows with entries over 100 except for the energy column.

I tried this: data = data[data.drop(columns=['energy_100g']) <= 100]

But is also got rid of all entries in the energy column.

Could someone help me out?

I would really appreciate it!

jpp · Accepted Answer

Assuming you wish to drop rows where any column except for energy_100g has a value > 100. In this case, you can use pd.DataFrame.any to create a Boolean mask:

data = data[~(data[data.columns.difference(['energy_100g'])] > 100).any(1)]

Equivalently, you can use all without negation and the reverse comparison:

data = data[(data[data.columns.difference(['energy_100g'])] <= 100).all(1)]

In each case, we use pd.Index.difference to exclude the specified column for our comparisons.

Exclude all rows with values over 100 except for one column

Answers (1)

Related Questions