Reputation: 63
I have a large data frame with 85 columns. The missing data has been coded as NaN
. My goal is to get the amount of missing data in each column. So I wrote a for loop to create a list to get the amounts. But it does not work.
The followings are my codes:
headers = x.columns.values.tolist()
nans=[]
for head in headers:
nans_col = x[x.head == 'NaN'].shape[0]
nan.append(nans_col)
I tried to use the codes in the loop to generate the amount of missing value for a specific column by changing head
to that column's name, then the code works and gave me the amount of missing data in that column.
So I do not know how to correct the for loop codes. Is somebody kind to help me with this? I highly appreciate your help.
Upvotes: 5
Views: 11915
Reputation: 19648
Just use Dataframe.info, and non-null count is probably what you want and more.
>>> pd.DataFrame({'a':[1,2], 'b':[None, None], 'c':[3, None]}) \
.info(verbose=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 2 non-null int64
1 b 0 non-null object
2 c 1 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 176.0+ bytes
Upvotes: 1
Reputation: 21
#function to show the nulls total values per column
colum_name = np.array(data.columns.values)
def iter_columns_name(colum_name):
for k in colum_name:
print("total nulls {}=".format(k),pd.isnull(data[k]).values.ravel().sum())
#call the function
iter_columns_name(colum_name)
#outout
total nulls start_date= 0
total nulls end_date= 0
total nulls created_on= 0
total nulls lat= 9925
.
.
.
Upvotes: 0
Reputation: 11
This gives you a count (by column name) of the number of values missing (printed as True followed by the count)
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
print(column)
print(missing_data[column].value_counts())
print("")
Upvotes: 1
Reputation: 47
If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage
def miss_data(df):
x = ['column_name','missing_data', 'missing_in_percentage']
missing_data = pd.DataFrame(columns=x)
columns = df.columns
for col in columns:
icolumn_name = col
imissing_data = df[col].isnull().sum()
imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
print(missing_data)
Upvotes: 0
Reputation: 990
For columns in pandas (python data analysis library) you can use:
In [3]: import numpy as np
In [4]: import pandas as pd
In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
In [6]: df.isnull().sum()
Out[6]:
a 1
b 2
dtype: int64
For a single column or for sereis you can count the missing values as shown below:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: s = pd.Series([1,2,3, np.nan, np.nan])
In [4]: s.isnull().sum()
Out[4]: 2
Upvotes: 10