ha9u63a7
ha9u63a7

Reputation: 6824

Pandas - How to identify `nan` values in a Series

I am currently playing with Kaggle Titanic dataset (train.csv)

  1. I can load the data fine.
  2. I understood that some data in Embarked column has nan value. But when I tried to filter it using the following code, I am getting an empty array
    import pandas as pd
    df = df.read_csv(<file_loc>, header=0)
    df[df.Embarked == 'nan']

I tried to import numpy.nan to replace the string nan above. But it doesn't work.

What am I trying to find - is all the cells which are not 'S', 'C', 'Q'.

Also realised later that.... the nan is a Float type using type(df.Embarked.unique()[-1]). Could someone help me understand how to identify those nan cells?

Upvotes: 4

Views: 7512

Answers (2)

Glauco
Glauco

Reputation: 1465

Starting from the v1.5 pandas introduced some news related to this topic.

As mentioned in the official documentation

pandas missing placeholder should be but there are some corner cases in which np.nan create problems.

> A = pd.Series([0.1, 0, None], dtype="Float32")
> A

0     0.1
1     0.0
2    <NA>
dtype: Float32

You cannot assign nan to it, it will be translated into <NA>

> A[0] = np.nan
> A

0    <NA>
1     0.0
2    <NA>
dtype: Float32

The .isna(), as explained before, is supposed to find all NA values.

> A.isna()

0     True
1    False
2     True
dtype: bool

But pandas can sometimes introduce the np.nan if there are computation problems.

> A/A
Out[20]: 
0    <NA>
1     NaN
2    <NA>
dtype: Float32

And it doesn't feed well the .isna()

> (A/A).isna()

0     True
1    False
2     True
dtype: bool

But you can count on the different equality behavior:

> pd.NA == pd.NA
<NA>

> np.nan == np.nan
False

Upvotes: 0

Cimbali
Cimbali

Reputation: 11395

NaN is used to represent missing values.

  • To find them, use .isna()

    Detect missing values.

  • To replace them, use .fillna(value)

    Fill NA/NaN values

Some examples on a series called col:

>>> col
0    1.0
1    NaN
2    2.0
dtype: float64
>>> col[col.isna()]
1   NaN
dtype: float64
>>> col.index[col.isna()]
Int64Index([1], dtype='int64')
>>> col.fillna(-1)
0    1.0
1   -1.0
2    2.0
dtype: float64

Note that you can’t compare equality with nan as by definition it’s not equal to anything, not even itself:

>>> np.nan == np.nan
False

This is likely the property that is used to identify nan under the hood:

>>> col != col
0    False
1     True
2    False
dtype: bool

But it’s better (more readable) to use the pandas functions than to test for inequality yourself.

Upvotes: 4

Related Questions