ASH
ASH

Reputation: 20302

How can we numerically filter a dataframe based on multiple conditions if there are strings in one of the columns?

I am trying to filter a data frame based on two conditions.

  1. filter items where the field named 'Count' > 30 ...and...
  2. filter items where the field named 'LAND SQUARE FEET' < 5000

Here is the code that I tried, and I got errors, or I wouldn't be posting here.

df.loc[(df['Count']>=30) & (df['LAND SQUARE FEET']< 5000)['Count','LAND SQUARE FEET']]

df[df.eval("Count>=30 & (LAND SQUARE FEET <5000).values")]

How can I get this to work?

Upvotes: 0

Views: 73

Answers (2)

David Erickson
David Erickson

Reputation: 16683

Specifically, the error(see comments of other answer) is now saying there is a - in one of the values in your column, You can use my solution below OR you can do df["LAND SQUARE FEET"] = df["LAND SQUARE FEET"] =.replace('-','').astype(int) However, there may be other strings that you need to replace, that might mean you keep seeing errors if there are more strings, other than -, for example , or ft.. Also, that line with - might be bad data altogether as I'm not sure why a - would be in a number that is supposed to be an integer of square feet.

Also, you can look at that line specifically with df[df["LAND SQUARE FEET"].str.contains('-') and from there decide what you want to do with it -- either manipulate it's data with replace or make it NaN with pd.to_numeric()


Solution with pd.to_numeric():

You need to use pd.to_numeric() first as you have strings in your column and passing errors='coerce' changes to NaN for those values that are strings. The data type of the column should now be a float if you call df.info():

Step 1:

df = pd.DataFrame({"LAND SQUARE FEET" : [4500, '4,400 feet', '4,600', 4700, 5500, 6000],
                   "Count" : [45,55,65,75,15,25]})
df
Out[1]: 
  LAND SQUARE FEET  Count
0             4500     45
1       4,400 feet     55
2            4,600     65
3             4700     75
4             5500     15
5             6000     25

Step 2:

df = pd.DataFrame({"LAND SQUARE FEET" : [4500, '4,400 feet', '4,600', 4700, 5500, 6000],
                   "Count" : [45,55,65,75,15,25]})
df["LAND SQUARE FEET"] = pd.to_numeric(df["LAND SQUARE FEET"], errors='coerce')
df
Out[2]: 
   LAND SQUARE FEET  Count
0            4500.0     45
1               NaN     55
2               NaN     65
3            4700.0     75
4            5500.0     15
5            6000.0     25

Step 3 (and final output):

df = pd.DataFrame({"LAND SQUARE FEET" : [4500, '4,400 feet', '4,600', 4700, 5500, 6000],
                   "Count" : [45,55,65,75,15,25]})
df["LAND SQUARE FEET"] = pd.to_numeric(df["LAND SQUARE FEET"], errors='coerce')
new_df = df.loc[(df['Count']>=30) & (df['LAND SQUARE FEET']< 5000),['Count','LAND SQUARE FEET']]
new_df
Out[3]: 
   Count  LAND SQUARE FEET
0     45            4500.0
3     75            4700.0

Upvotes: 1

kev1n
kev1n

Reputation: 400

df[(df["Count"] >= 30) & (df["LAND SQUARE FEET"] < 5000)]

Upvotes: 1

Related Questions