removing NaN values in python pandas

Question

Data is of income of adults from census data, rows look like:

31, Private, 84154, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 38, NaN, >50K
48, Self-emp-not-inc, 265477, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K

I'm trying to remove all rows with NaNs from a DataFrame loaded from a CSV file in pandas.

>>> import pandas as pd
>>> income = pd.read_csv('income.data')
>>> income['type'].unique()
array([ State-gov,  Self-emp-not-inc,  Private,  Federal-gov,  Local-gov,
    NaN,  Self-emp-inc,  Without-pay,  Never-worked], dtype=object)
>>> income.dropna(how='any') # should drop all rows with NaNs
>>> income['type'].unique()
array([ State-gov,  Self-emp-not-inc,  Private,  Federal-gov,  Local-gov,
    NaN,  Self-emp-inc,  Without-pay,  Never-worked], dtype=object)
    Self-emp-inc, nan], dtype=object) # what??
>>> income = income.dropna(how='any') # ok, maybe reassignment will work?
>>> income['type'].unique()
array([ State-gov,  Self-emp-not-inc,  Private,  Federal-gov,  Local-gov,
    NaN,  Self-emp-inc,  Without-pay,  Never-worked], dtype=object) # what??

I tried with a smaller example.csv:

label,age,sex
1,43,M
-1,NaN,F
1,65,NaN

And dropna() worked just fine here for both categorical and numerical NaNs. What is going on? I'm new to Pandas, just learning the ropes.

dorvak · Accepted Answer

As I wrote in the comment: The "NaN" has a leading whitespace (at least in the data you provided). Therefore, you need to specifiy the na_values paramter in the read_csv function.

Try this one:

df = pd.read_csv("income.csv",header=None,na_values=" NaN")

This is why your second example works, because there is no leading whitespace here.

removing NaN values in python pandas

Answers (2)

Drop all rows with NaN values

Reset index after drop

Drop row that has all NaN values

Drop rows that has NaN values on selected columns

Related Questions