fangh
fangh

Reputation: 371

Can't drop NAN with dropna in pandas

I import pandas as pd and run the code below and get the following result

Code:

traindataset = pd.read_csv('/Users/train.csv')
print traindataset.dtypes
print traindataset.shape
print traindataset.iloc[25,3]
traindataset.dropna(how='any')
print traindataset.iloc[25,3]
print traindataset.shape

Output

TripType                   int64  
VisitNumber                int64  
Weekday                   object  
Upc                      float64  
ScanCount                  int64  
DepartmentDescription     object  
FinelineNumber           float64  
dtype: object

(647054, 7)

nan  
nan

(647054, 7) 
[Finished in 2.2s]

From the result, the dropna line doesn't work because the row number doesn't change and there is still NAN in the dataframe. How that comes? I am craaaazy right now.

Upvotes: 23

Views: 85668

Answers (6)

darshita
darshita

Reputation: 11

def is_none_nan_or_blank(var):

    if var is None:
        return True

    if isinstance(var, float) and math.isnan(var):
        return True

    if isinstance(var, str) and var.strip() == '':
        return True
    
    if pd.isna(var):
        return True

    return False

Check it and return True drop it otherwise process it.

It will cover all kind of none,nan values in dataframes.

Upvotes: 1

user86460
user86460

Reputation: 11

Looks like NaN has some trailing or leading empty chars which is causing the issue.

After removing those empty chars, df.dropna() does remove the line with NaN in it.

Upvotes: 1

Robert Forderer
Robert Forderer

Reputation: 91

This is my first post. I just spent a few hours debugging this exact issue and I would like to share how I fixed this issue.

I was converting my entire dataframe to a string and then placing that value back into the dataframe using similar code to what is displayed below: (please note, the code below will only convert the value to a string)

row_counter = 0
for ind, row in dataf.iterrows():
    cell_value = str(row['column_header'])
    dataf.loc[row_counter, 'column_header'] = cell_value
    row_counter += 1

After converting the entire dataframe to a string, I then used the dropna() function. The values that were previously NaN (considered a null value by pandas) were converted to the string 'nan'.

In conclusion, drop blank values FIRST, before you start manipulating data in the CSV and converting its data type.

Upvotes: 7

jpp
jpp

Reputation: 164663

pd.DataFrame.dropna uses inplace=False by default. This is the norm with most Pandas operations; exceptions do exist, e.g. update.

Therefore, you must either assign back to your variable, or state explicitly inplace=True:

df = df.dropna(how='any')           # assign back
df.dropna(how='any', inplace=True)  # set inplace parameter

Stylistically, the former is often preferred as it supports operator chaining, and the latter often does not yield any or significant performance benefits.

Upvotes: 20

BrenBarn
BrenBarn

Reputation: 251383

You need to read the documentation (emphasis added):

Return object with labels on given axis omitted

dropna returns a new DataFrame. If you want it to modify the existing DataFrame, all you have to do is read further in the documentation:

inplace : boolean, default False

If True, do operation inplace and return None.

So to modify it in place, do traindataset.dropna(how='any', inplace=True).

Upvotes: 34

Himanshi Dixit
Himanshi Dixit

Reputation: 101

Alternatively, you can also use notnull() method to select the rows which are not null.

For example if you want to select Non null values from columns country and variety of the dataframe reviews:

answer=reviews.loc[(reviews.country.notnull()) & (reviews.variety.notnull())]

But here we are just selecting relevant data;to remove null values you should use dropna() method.

Upvotes: 7

Related Questions