Reputation: 865
As part of a data profiling exercise, I'm reading excel sheets into pandas dataframes.
df = pd.ExcelFile('file.xlsx').parse(0)
nullcounts = df.isnull().sum().to_frame('null_records')
Produces a nice frame with the null count for every series in my dataframe. But if the string 'NA' appears in a row of data, I don't want the isnull
operation to return True
.
Is there a simple way to do this without hard coding a rule for a specific column/dataframe?
Edit: It appears that the NAs in my source data are being ignored when being read into pandas, since when I load the data and compare visually I see NaN
where in excel there was NA
.
Upvotes: 2
Views: 2411
Reputation: 862611
If use read_excel
is possible define which values are converted to NaN
with parameter keep_default_na
and na_values
:
df = pd.read_excel('file.xlsx')
print (df)
a b
0 NaN NaN
1 3.0 6.0
nullcounts = df.isnull().sum().to_frame('null_records')
print (nullcounts)
null_records
a 1
b 1
df = pd.read_excel('file.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NA NaN
1 3 6.0
nullcounts = df.isnull().sum().to_frame('null_records')
print (nullcounts)
null_records
a 0
b 1
Upvotes: 5