hlin117
hlin117

Reputation: 22250

How to check if any value is NaN in a Pandas DataFrame

How do I check whether a pandas DataFrame has NaN values?

I know about pd.isnan but it returns a DataFrame of booleans. I also found this post but it doesn't exactly answer my question either.

Upvotes: 742

Views: 1619565

Answers (30)

hobs
hobs

Reputation: 19259

If you need to know how many rows there are with "one or more NaNs":

df.isnull().T.any().sum()

Or if you need to pull out these rows and examine them:

nan_rows = df[df.isnull().T.any()]

Upvotes: 69

Harivignesh
Harivignesh

Reputation: 11

You can't access NaN values in pandas using any comparision operators. np.nan and "None" can not be compared with the nan value present in the data. The reason is strange because when you see the type of the nan in data it is np.float64. nan in data can be accessed by using isna() function.

count=0
for i in data.columns:
    for j in data[i]:
        if isna(j):
            count+=1
print(count)

Hope it Helps!.

Upvotes: 0

Adarsh singh
Adarsh singh

Reputation: 135

df.isnull().sum()

This will return the count of all NaN values present in the respective columns of the DataFrame.

Upvotes: 6

cottontail
cottontail

Reputation: 23011

Given the following dataframe:

     A  B    C
0  1.0  a  NaN
1  2.0  b  4.0
2  NaN  c  5.0
  1. Check if there are any NaN values:
    df.isna().any(axis=None)              # True
    df.isna().to_numpy().any()            # True
    df.ne(df).any(axis=None)              # True
    (df!=df).any(axis=None)               # True
    df.eval("A!=A or B!=B or C!=C").any() # True
    
  2. Column labels with NaN values:
    df.isna().any().pipe(lambda x: x.index[x])        
    
    Index(['A', 'C'], dtype='object')
    
  3. Index labels with NaN values:
    df.isna().any(axis=1).pipe(lambda x: x.index[x])
    
    Index([0, 2], dtype='int64')
    
  4. Columns with NaN values:
    df.loc[:, df.isna().any()]
    
         A    C
    0  1.0  NaN
    1  2.0  4.0
    2  NaN  5.0
    
  5. Rows with NaN values:
    df[df.isna().any(axis=1)]
    
         A  B    C
    0  1.0  a  NaN
    2  NaN  c  5.0
    

Upvotes: 1

Jaya Raghavendra
Jaya Raghavendra

Reputation: 1577

This is code makes your life easy

import sidetable

df.stb.missing()

Check this out : https://github.com/chris1610/sidetable

enter image description here

Upvotes: 0

Brndn
Brndn

Reputation: 846

This will only include columns with at least 1 null/na value.

 df.isnull().sum()[df.isnull().sum()>0]

Upvotes: 4

FAISAL BARGI
FAISAL BARGI

Reputation: 30

Bar representation for missing values

import missingno
missingno.bar(df)# will give you exact no of values and values missing

Upvotes: 0

Daniel Malachov
Daniel Malachov

Reputation: 1842

I recommend to use values attribute as evaluation on array is much faster.

arr = np.random.randn(100, 100)
arr[40, 40] = np.nan
df = pd.DataFrame(arr)

%timeit np.isnan(df.values).any()  # 7.56 µs
%timeit np.isnan(df).any()         # 627 µs
%timeit df.isna().any(axis=None)   # 572 µs

Result:

7.56 µs ± 447 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
627 µs ± 40.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
572 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note: You need to run %timeit in Jupyter notebook to work

Upvotes: 4

U13-Forward
U13-Forward

Reputation: 71560

Another way is to dropna and check if the lengths are equivalent:

>>> len(df.dropna()) != len(df)
True
>>> 

Upvotes: 2

Mohamed Othman
Mohamed Othman

Reputation: 99

try the following

df.isnull().sum()

or

df.isna().values.any()

Upvotes: 8

Pobaranchuk
Pobaranchuk

Reputation: 877

To do this we can use the statement df.isna().any() . This will check all of our columns and return True if there are any missing values or NaNs, or False if there are no missing values.

Upvotes: 3

Naveen Reddy Marthala
Naveen Reddy Marthala

Reputation: 3123

let df be the name of the Pandas DataFrame and any value that is numpy.nan is a null value.

  1. If you want to see which columns has nulls and which do not(just True and False)

    df.isnull().any()
    
  2. If you want to see only the columns that has nulls

    df.loc[:, df.isnull().any()].columns
    
  3. If you want to see the count of nulls in every column

    df.isna().sum()
    
  4. If you want to see the percentage of nulls in every column

    df.isna().sum()/(len(df))*100
    
  5. If you want to see the percentage of nulls in columns only with nulls:

df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100

EDIT 1:

If you want to see where your data is missing visually:

import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])

Upvotes: 22

S Anand
S Anand

Reputation: 11938

jwilner's response is spot on. I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:

df.isnull().values.any()

enter image description here

import numpy as np
import pandas as pd
import perfplot


def setup(n):
    df = pd.DataFrame(np.random.randn(n))
    df[df > 0.9] = np.nan
    return df


def isnull_any(df):
    return df.isnull().any()


def isnull_values_sum(df):
    return df.isnull().values.sum() > 0


def isnull_sum(df):
    return df.isnull().sum() > 0


def isnull_values_any(df):
    return df.isnull().values.any()


perfplot.save(
    "out.png",
    setup=setup,
    kernels=[isnull_any, isnull_values_sum, isnull_sum, isnull_values_any],
    n_range=[2 ** k for k in range(25)],
)

df.isnull().sum().sum() is a bit slower, but of course, has additional information -- the number of NaNs.

Upvotes: 878

Aditya
Aditya

Reputation: 458

We can see the null values present in the dataset by generating heatmap using seaborn moduleheatmap

import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)

Upvotes: 4

Nizam
Nizam

Reputation: 400

You could not only check if any 'NaN' exist but also get the percentage of 'NaN's in each column using the following,

df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})  
df  

   col1 col2  
0   1   6.0  
1   2   NaN  
2   3   8.0  
3   4   9.0  
4   5   10.0  


df.isnull().sum()/len(df)  
col1    0.0  
col2    0.2  
dtype: float64

Upvotes: 0

Ikbel
Ikbel

Reputation: 2203

import missingno as msno
msno.matrix(df)  # just to visualize. no missing value.

enter image description here

Upvotes: 2

prosti
prosti

Reputation: 46291

The best would be to use:

df.isna().any().any()

Here is why. So isna() is used to define isnull(), but both of these are identical of course.

This is even faster than the accepted answer and covers all 2D panda arrays.

Upvotes: 3

cs95
cs95

Reputation: 402253

Super Simple Syntax: df.isna().any(axis=None)

Starting from v0.23.2, you can use DataFrame.isna + DataFrame.any(axis=None) where axis=None specifies logical reduction over the entire DataFrame.

# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
     A    B
0  1.0  NaN
1  2.0  4.0
2  NaN  5.0

df.isna()

       A      B
0  False   True
1  False  False
2   True  False

df.isna().any(axis=None)
# True

Useful Alternatives

numpy.isnan
Another performant option if you're running older versions of pandas.

np.isnan(df.values)

array([[False,  True],
       [False, False],
       [ True, False]])

np.isnan(df.values).any()
# True

Alternatively, check the sum:

np.isnan(df.values).sum()
# 2

np.isnan(df.values).sum() > 0
# True

Series.hasnans
You can also iteratively call Series.hasnans. For example, to check if a single column has NaNs,

df['A'].hasnans
# True

And to check if any column has NaNs, you can use a comprehension with any (which is a short-circuiting operation).

any(df[c].hasnans for c in df)
# True

This is actually very fast.

Upvotes: 38

Peter Thomas
Peter Thomas

Reputation: 81

I've been using the following and type casting it to a string and checking for the nan value

   (str(df.at[index, 'column']) == 'nan')

This allows me to check specific value in a series and not just return if this is contained somewhere within the series.

Upvotes: 8

Alex Dlikman
Alex Dlikman

Reputation: 61

df.apply(axis=0, func=lambda x : any(pd.isnull(x)))

Will check for each column if it contains Nan or not.

Upvotes: 1

Jagannath Banerjee
Jagannath Banerjee

Reputation: 2141

Here is another interesting way of finding null and replacing with a calculated value

    #Creating the DataFrame

    testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3     NaN
    3       40       4     NaN
    4       50       5   250.0

    #Identifying the rows with empty columns
    nan_rows = testdf2[testdf2['Yearly'].isnull()]
    >>> nan_rows
       Monthly  Tenure  Yearly
    2       30       3     NaN
    3       40       4     NaN

    #Getting the rows# into a list
    >>> index = list(nan_rows.index)
    >>> index
    [2, 3]

    # Replacing null values with calculated value
    >>> for i in index:
        testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3    90.0
    3       40       4   160.0
    4       50       5   250.0

Upvotes: 4

Jan Sila
Jan Sila

Reputation: 1593

Or you can use .info() on the DF such as :

df.info(null_counts=True) which returns the number of non_null rows in a columns such as:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches                          3276314 non-null int64
avg_pic_distance                   3276314 non-null float64

Upvotes: 2

Andy
Andy

Reputation: 50540

You have a couple of options.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan

Now the data frame looks something like this:

          0         1         2         3         4         5
0  0.520113  0.884000  1.260966 -0.236597  0.312972 -0.196281
1 -0.837552       NaN  0.143017  0.862355  0.346550  0.842952
2 -0.452595       NaN -0.420790  0.456215  1.203459  0.527425
3  0.317503 -0.917042  1.780938 -1.584102  0.432745  0.389797
4 -0.722852  1.704820 -0.113821 -1.466458  0.083002  0.011722
5 -0.622851 -0.251935 -1.498837       NaN  1.098323  0.273814
6  0.329585  0.075312 -0.690209 -3.807924  0.489317 -0.841368
7 -1.123433 -1.187496  1.868894 -2.046456 -0.949718       NaN
8  1.133880 -0.110447  0.050385 -1.158387  0.188222       NaN
9 -0.513741  1.196259  0.704537  0.982395 -0.585040 -1.693810
  • Option 1: df.isnull().any().any() - This returns a boolean value

You know of the isnull() which would return a dataframe like this:

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False   True  False  False  False  False
2  False   True  False  False  False  False
3  False  False  False  False  False  False
4  False  False  False  False  False  False
5  False  False  False   True  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False   True
8  False  False  False  False  False   True
9  False  False  False  False  False  False

If you make it df.isnull().any(), you can find just the columns that have NaN values:

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

One more .any() will tell you if any of the above are True

> df.isnull().any().any()
True
  • Option 2: df.isnull().sum().sum() - This returns an integer of the total number of NaN values:

This operates the same way as the .any().any() does, by first giving a summation of the number of NaN values in a column, then the summation of those values:

df.isnull().sum()
0    0
1    2
2    0
3    1
4    0
5    2
dtype: int64

Finally, to get the total number of NaN values in the DataFrame:

df.isnull().sum().sum()
5

Upvotes: 245

Ihor Ivasiuk
Ihor Ivasiuk

Reputation: 1315

To find out which rows have NaNs in a specific column:

nan_rows = df[df['name column'].isnull()]

Upvotes: 121

frankchen0130
frankchen0130

Reputation: 547

Just using math.isnan(x), Return True if x is a NaN (not a number), and False otherwise.

Upvotes: 5

chmodsss
chmodsss

Reputation: 736

Since none have mentioned, there is just another variable called hasnans.

df[i].hasnans will output to True if one or more of the values in the pandas Series is NaN, False if not. Note that its not a function.

pandas version '0.19.2' and '0.20.2'

Upvotes: 11

Ankit
Ankit

Reputation: 341

Adding to Hobs brilliant answer, I am very new to Python and Pandas so please point out if I am wrong.

To find out which rows have NaNs:

nan_rows = df[df.isnull().any(1)]

would perform the same operation without the need for transposing by specifying the axis of any() as 1 to check if 'True' is present in rows.

Upvotes: 24

Marshall Farrier
Marshall Farrier

Reputation: 967

Since pandas has to find this out for DataFrame.dropna(), I took a look to see how they implement it and discovered that they made use of DataFrame.count(), which counts all non-null values in the DataFrame. Cf. pandas source code. I haven't benchmarked this technique, but I figure the authors of the library are likely to have made a wise choice for how to do it.

Upvotes: 8

unique_beast
unique_beast

Reputation: 1470

Depending on the type of data you're dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False.

for col in df:
   print df[col].value_counts(dropna=False)

Works well for categorical variables, not so much when you have many unique values.

Upvotes: -1

jwilner
jwilner

Reputation: 6596

df.isnull().any().any() should do it.

Upvotes: 61

Related Questions