Reputation: 21242
I am spinning my wheels on a function which is supposed to take a data frame, iterate through the columns and depending on the type of column, fillna with either 0's for numeric types, 'na' for string types, a default date for date types and False for bools.
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'a':[1,np.nan,3,np.nan],
'b':['abc',np.nan,'def','hij'],
'c':[True,False,np.nan,np.nan,],
'd': ['2020-01-01', '2020-01-02', np.nan, '2020-01-04']})
Looks like:
df
Out[96]:
a b c d
0 1.0 abc True 2020-01-01
1 NaN NaN False 2020-01-02
2 3.0 def NaN NaN
3 NaN hij NaN 2020-01-04
I attempted to create a function to fill nan values:
def handle_nans(df, d):
"""
:param df: a dataframe
:param d: current iterations run_date
:return: a data frame with replacement of na values as either 0 for numeric fields, 'na' for text and False for bool
"""
for f in df:
if df[f].dtype == "int64":
df[f] = df[f].fillna(0)
elif df[[f]].columns == 'date':
df[f] = df[f].fillna(d)
elif df[[f]].columns == 'sampling':
df[f] = df[f].fillna(False)
else:
df[f] = df[f].fillna(0)
return(df)
Tried:
handle_nans(df, '2019-01-01')
Out[100]:
a b c d
0 1.0 abc True 2020-01-01
1 0.0 0 False 2020-01-02
2 3.0 def 0 0
3 0.0 hij 0 2020-01-04
So, this appears to have correctly converted the numeric columns to 0. But I wanted to change string types to 'na', date types to the default date and boolean types to True.
How can I do that?
Upvotes: 3
Views: 2476
Reputation: 17834
To identify data types you can apply the funtions type
or isinstace
to each value. For example:
float datetime bool string
0 1.0 2018-03-10 True foo
1 NaN NaT NaN None
df.applymap(type)
Output:
float datetime \
0 <class 'float'> <class 'pandas._libs.tslibs.timestamps.Timesta...
1 <class 'float'> <class 'pandas._libs.tslibs.nattype.NaTType'>
bool string
0 <class 'bool'> <class 'str'>
1 <class 'float'> <class 'NoneType'>
You can use the following approach:
mapping = [(float, 0), (bool, False), (str, 'na'), (pd.Timestamp, pd.Timestamp('20200118'))]
df_type = df.applymap(type)
df_notna = df.notna()
for t, v in mapping:
mask = ((df_type == t) & df_notna).any()
for col in df.columns[mask]:
df[col].fillna(v, inplace=True)
Output:
float datetime bool string
0 1.0 2018-03-10 True foo
1 0.0 2020-01-18 False na
As you can see the time column should be parsed first otherwise it will be identified as a string.
Upvotes: 0
Reputation: 21719
Just a slight modification needed in your function:
# set to timestamp
ex_df['d'] = pd.to_datetime(ex_df['d'])
def handle_nas(df, default_date='2020-01-01'):
"""
:param df: a dataframe
:param d: current iterations run_date
:return: a data frame with replacement of na values as either 0 for numeric fields, 'na' for text and False for bool
"""
for f in df.columns:
# integer
if df[f].dtype == "int":
df[f] = df[f].fillna(0)
# dates
elif df[f].dtype == '<M8[ns]':
df[f] = df[f].fillna(pd.to_datetime(default_date))
# boolean
elif df[f].dtype == 'bool':
df[f] = df[f].fillna(True)
# string
else:
df[f] = df[f].fillna('na')
return df
a b c d
0 1 abc True 2020-01-01
1 2 na False 2020-01-02
2 3 def na 2020-01-01
3 na hij True 2020-01-04
You can replace the possible na's by doing this:
possible_nas = ['na','None','null','nan','none'] # add whatever you think are nas
ex_df = ex_df.applymap(lambda x: None if x in possible_nas else x)
Why NA values are a pain in the data frames ?
# sample dfs
df = pd.DataFrame({'a':['nan','bl','bo'], 'b':[None, 1, 2], 'c': ['none','null','go']})
# check the type of nulls
type(df['b'][0])
numpy.float64
And, none of the value in possible_nas
match this type. So, such values will never be replaced.
Upvotes: 2