joshi123
joshi123

Reputation: 865

pandas redefine isnull to ignore 'NA'

As part of a data profiling exercise, I'm reading excel sheets into pandas dataframes.

df = pd.ExcelFile('file.xlsx').parse(0)
nullcounts = df.isnull().sum().to_frame('null_records')

Produces a nice frame with the null count for every series in my dataframe. But if the string 'NA' appears in a row of data, I don't want the isnull operation to return True.

Is there a simple way to do this without hard coding a rule for a specific column/dataframe?

Edit: It appears that the NAs in my source data are being ignored when being read into pandas, since when I load the data and compare visually I see NaN where in excel there was NA.

Upvotes: 2

Views: 2411

Answers (1)

jezrael
jezrael

Reputation: 862611

If use read_excel is possible define which values are converted to NaN with parameter keep_default_na and na_values:

df = pd.read_excel('file.xlsx')
print (df)
     a    b
0  NaN  NaN
1  3.0  6.0

nullcounts = df.isnull().sum().to_frame('null_records')
print (nullcounts)
   null_records
a             1
b             1

df = pd.read_excel('file.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
    a    b
0  NA  NaN
1   3  6.0

nullcounts = df.isnull().sum().to_frame('null_records')
print (nullcounts)
   null_records
a             0
b             1

Upvotes: 5

Related Questions