Reputation: 22250
How do I check whether a pandas DataFrame has NaN values?
I know about pd.isnan
but it returns a DataFrame of booleans. I also found this post but it doesn't exactly answer my question either.
Upvotes: 742
Views: 1619565
Reputation: 19259
If you need to know how many rows there are with "one or more NaN
s":
df.isnull().T.any().sum()
Or if you need to pull out these rows and examine them:
nan_rows = df[df.isnull().T.any()]
Upvotes: 69
Reputation: 11
You can't access NaN values in pandas using any comparision operators. np.nan and "None" can not be compared with the nan value present in the data. The reason is strange because when you see the type of the nan in data it is np.float64. nan in data can be accessed by using isna() function.
count=0
for i in data.columns:
for j in data[i]:
if isna(j):
count+=1
print(count)
Hope it Helps!.
Upvotes: 0
Reputation: 135
df.isnull().sum()
This will return the count of all NaN values present in the respective columns of the DataFrame.
Upvotes: 6
Reputation: 23011
Given the following dataframe:
A B C
0 1.0 a NaN
1 2.0 b 4.0
2 NaN c 5.0
df.isna().any(axis=None) # True
df.isna().to_numpy().any() # True
df.ne(df).any(axis=None) # True
(df!=df).any(axis=None) # True
df.eval("A!=A or B!=B or C!=C").any() # True
df.isna().any().pipe(lambda x: x.index[x])
Index(['A', 'C'], dtype='object')
df.isna().any(axis=1).pipe(lambda x: x.index[x])
Index([0, 2], dtype='int64')
df.loc[:, df.isna().any()]
A C
0 1.0 NaN
1 2.0 4.0
2 NaN 5.0
df[df.isna().any(axis=1)]
A B C
0 1.0 a NaN
2 NaN c 5.0
Upvotes: 1
Reputation: 1577
This is code makes your life easy
import sidetable
df.stb.missing()
Check this out : https://github.com/chris1610/sidetable
Upvotes: 0
Reputation: 846
This will only include columns with at least 1 null/na value.
df.isnull().sum()[df.isnull().sum()>0]
Upvotes: 4
Reputation: 30
Bar representation for missing values
import missingno
missingno.bar(df)# will give you exact no of values and values missing
Upvotes: 0
Reputation: 1842
I recommend to use values attribute as evaluation on array is much faster.
arr = np.random.randn(100, 100)
arr[40, 40] = np.nan
df = pd.DataFrame(arr)
%timeit np.isnan(df.values).any() # 7.56 µs
%timeit np.isnan(df).any() # 627 µs
%timeit df.isna().any(axis=None) # 572 µs
Result:
7.56 µs ± 447 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
627 µs ± 40.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
572 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note: You need to run %timeit
in Jupyter notebook to work
Upvotes: 4
Reputation: 71560
Another way is to dropna
and check if the lengths are equivalent:
>>> len(df.dropna()) != len(df)
True
>>>
Upvotes: 2
Reputation: 99
try the following
df.isnull().sum()
or
df.isna().values.any()
Upvotes: 8
Reputation: 877
To do this we can use the statement df.isna().any()
. This will check all of our columns and return True
if there are any missing values or NaN
s, or False
if there are no missing values.
Upvotes: 3
Reputation: 3123
let df
be the name of the Pandas DataFrame and any value that is numpy.nan
is a null value.
If you want to see which columns has nulls and which do not(just True and False)
df.isnull().any()
If you want to see only the columns that has nulls
df.loc[:, df.isnull().any()].columns
If you want to see the count of nulls in every column
df.isna().sum()
If you want to see the percentage of nulls in every column
df.isna().sum()/(len(df))*100
If you want to see the percentage of nulls in columns only with nulls:
df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100
EDIT 1:
If you want to see where your data is missing visually:
import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])
Upvotes: 22
Reputation: 11938
jwilner's response is spot on. I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
import numpy as np
import pandas as pd
import perfplot
def setup(n):
df = pd.DataFrame(np.random.randn(n))
df[df > 0.9] = np.nan
return df
def isnull_any(df):
return df.isnull().any()
def isnull_values_sum(df):
return df.isnull().values.sum() > 0
def isnull_sum(df):
return df.isnull().sum() > 0
def isnull_values_any(df):
return df.isnull().values.any()
perfplot.save(
"out.png",
setup=setup,
kernels=[isnull_any, isnull_values_sum, isnull_sum, isnull_values_any],
n_range=[2 ** k for k in range(25)],
)
df.isnull().sum().sum()
is a bit slower, but of course, has additional information -- the number of NaNs
.
Upvotes: 878
Reputation: 458
We can see the null values present in the dataset by generating heatmap using seaborn moduleheatmap
import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)
Upvotes: 4
Reputation: 400
You could not only check if any 'NaN' exist but also get the percentage of 'NaN's in each column using the following,
df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})
df
col1 col2
0 1 6.0
1 2 NaN
2 3 8.0
3 4 9.0
4 5 10.0
df.isnull().sum()/len(df)
col1 0.0
col2 0.2
dtype: float64
Upvotes: 0
Reputation: 2203
import missingno as msno
msno.matrix(df) # just to visualize. no missing value.
Upvotes: 2
Reputation: 46291
The best would be to use:
df.isna().any().any()
Here is why. So isna()
is used to define isnull()
, but both of these are identical of course.
This is even faster than the accepted answer and covers all 2D panda arrays.
Upvotes: 3
Reputation: 402253
df.isna().any(axis=None)
Starting from v0.23.2, you can use DataFrame.isna
+ DataFrame.any(axis=None)
where axis=None
specifies logical reduction over the entire DataFrame.
# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
A B
0 1.0 NaN
1 2.0 4.0
2 NaN 5.0
df.isna()
A B
0 False True
1 False False
2 True False
df.isna().any(axis=None)
# True
numpy.isnan
Another performant option if you're running older versions of pandas.
np.isnan(df.values)
array([[False, True],
[False, False],
[ True, False]])
np.isnan(df.values).any()
# True
Alternatively, check the sum:
np.isnan(df.values).sum()
# 2
np.isnan(df.values).sum() > 0
# True
Series.hasnans
You can also iteratively call Series.hasnans
. For example, to check if a single column has NaNs,
df['A'].hasnans
# True
And to check if any column has NaNs, you can use a comprehension with any
(which is a short-circuiting operation).
any(df[c].hasnans for c in df)
# True
This is actually very fast.
Upvotes: 38
Reputation: 81
I've been using the following and type casting it to a string and checking for the nan value
(str(df.at[index, 'column']) == 'nan')
This allows me to check specific value in a series and not just return if this is contained somewhere within the series.
Upvotes: 8
Reputation: 61
df.apply(axis=0, func=lambda x : any(pd.isnull(x)))
Will check for each column if it contains Nan or not.
Upvotes: 1
Reputation: 2141
Here is another interesting way of finding null and replacing with a calculated value
#Creating the DataFrame
testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
>>> testdf2
Monthly Tenure Yearly
0 10 1 10.0
1 20 2 40.0
2 30 3 NaN
3 40 4 NaN
4 50 5 250.0
#Identifying the rows with empty columns
nan_rows = testdf2[testdf2['Yearly'].isnull()]
>>> nan_rows
Monthly Tenure Yearly
2 30 3 NaN
3 40 4 NaN
#Getting the rows# into a list
>>> index = list(nan_rows.index)
>>> index
[2, 3]
# Replacing null values with calculated value
>>> for i in index:
testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
>>> testdf2
Monthly Tenure Yearly
0 10 1 10.0
1 20 2 40.0
2 30 3 90.0
3 40 4 160.0
4 50 5 250.0
Upvotes: 4
Reputation: 1593
Or you can use .info()
on the DF
such as :
df.info(null_counts=True)
which returns the number of non_null rows in a columns such as:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches 3276314 non-null int64
avg_pic_distance 3276314 non-null float64
Upvotes: 2
Reputation: 50540
You have a couple of options.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan
Now the data frame looks something like this:
0 1 2 3 4 5
0 0.520113 0.884000 1.260966 -0.236597 0.312972 -0.196281
1 -0.837552 NaN 0.143017 0.862355 0.346550 0.842952
2 -0.452595 NaN -0.420790 0.456215 1.203459 0.527425
3 0.317503 -0.917042 1.780938 -1.584102 0.432745 0.389797
4 -0.722852 1.704820 -0.113821 -1.466458 0.083002 0.011722
5 -0.622851 -0.251935 -1.498837 NaN 1.098323 0.273814
6 0.329585 0.075312 -0.690209 -3.807924 0.489317 -0.841368
7 -1.123433 -1.187496 1.868894 -2.046456 -0.949718 NaN
8 1.133880 -0.110447 0.050385 -1.158387 0.188222 NaN
9 -0.513741 1.196259 0.704537 0.982395 -0.585040 -1.693810
df.isnull().any().any()
- This returns a boolean valueYou know of the isnull()
which would return a dataframe like this:
0 1 2 3 4 5
0 False False False False False False
1 False True False False False False
2 False True False False False False
3 False False False False False False
4 False False False False False False
5 False False False True False False
6 False False False False False False
7 False False False False False True
8 False False False False False True
9 False False False False False False
If you make it df.isnull().any()
, you can find just the columns that have NaN
values:
0 False
1 True
2 False
3 True
4 False
5 True
dtype: bool
One more .any()
will tell you if any of the above are True
> df.isnull().any().any()
True
df.isnull().sum().sum()
- This returns an integer of the total number of NaN
values:This operates the same way as the .any().any()
does, by first giving a summation of the number of NaN
values in a column, then the summation of those values:
df.isnull().sum()
0 0
1 2
2 0
3 1
4 0
5 2
dtype: int64
Finally, to get the total number of NaN values in the DataFrame:
df.isnull().sum().sum()
5
Upvotes: 245
Reputation: 1315
To find out which rows have NaNs in a specific column:
nan_rows = df[df['name column'].isnull()]
Upvotes: 121
Reputation: 547
Just using math.isnan(x), Return True if x is a NaN (not a number), and False otherwise.
Upvotes: 5
Reputation: 736
Since none have mentioned, there is just another variable called hasnans
.
df[i].hasnans
will output to True
if one or more of the values in the pandas Series is NaN, False
if not. Note that its not a function.
pandas version '0.19.2' and '0.20.2'
Upvotes: 11
Reputation: 341
Adding to Hobs brilliant answer, I am very new to Python and Pandas so please point out if I am wrong.
To find out which rows have NaNs:
nan_rows = df[df.isnull().any(1)]
would perform the same operation without the need for transposing by specifying the axis of any() as 1 to check if 'True' is present in rows.
Upvotes: 24
Reputation: 967
Since pandas
has to find this out for DataFrame.dropna()
, I took a look to see how they implement it and discovered that they made use of DataFrame.count()
, which counts all non-null values in the DataFrame
. Cf. pandas source code. I haven't benchmarked this technique, but I figure the authors of the library are likely to have made a wise choice for how to do it.
Upvotes: 8
Reputation: 1470
Depending on the type of data you're dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False.
for col in df:
print df[col].value_counts(dropna=False)
Works well for categorical variables, not so much when you have many unique values.
Upvotes: -1