Doug Fir
Doug Fir

Reputation: 21242

fillna() depending on column type function

I am spinning my wheels on a function which is supposed to take a data frame, iterate through the columns and depending on the type of column, fillna with either 0's for numeric types, 'na' for string types, a default date for date types and False for bools.

import pandas as pd
import numpy as np
df = pd.DataFrame(data={'a':[1,np.nan,3,np.nan],
                        'b':['abc',np.nan,'def','hij'],
                        'c':[True,False,np.nan,np.nan,],
                        'd': ['2020-01-01', '2020-01-02', np.nan, '2020-01-04']})

Looks like:

df
Out[96]: 
     a    b      c           d
0  1.0  abc   True  2020-01-01
1  NaN  NaN  False  2020-01-02
2  3.0  def    NaN         NaN
3  NaN  hij    NaN  2020-01-04

I attempted to create a function to fill nan values:

def handle_nans(df, d):
    """
    :param df: a dataframe
    :param d: current iterations run_date
    :return: a data frame with replacement of na values as either 0 for numeric fields, 'na' for text and False for bool
    """
    for f in df:
        if df[f].dtype == "int64":
            df[f] = df[f].fillna(0)
        elif df[[f]].columns == 'date':
            df[f] = df[f].fillna(d)
        elif df[[f]].columns == 'sampling':
            df[f] = df[f].fillna(False)
        else:
            df[f] = df[f].fillna(0)
    return(df)

Tried:

handle_nans(df, '2019-01-01')
Out[100]: 
     a    b      c           d
0  1.0  abc   True  2020-01-01
1  0.0    0  False  2020-01-02
2  3.0  def      0           0
3  0.0  hij      0  2020-01-04

So, this appears to have correctly converted the numeric columns to 0. But I wanted to change string types to 'na', date types to the default date and boolean types to True.

How can I do that?

Upvotes: 3

Views: 2476

Answers (2)

Mykola Zotko
Mykola Zotko

Reputation: 17834

To identify data types you can apply the funtions type or isinstace to each value. For example:

   float   datetime  bool string
0    1.0 2018-03-10  True    foo
1    NaN        NaT   NaN   None

df.applymap(type)

Output:

             float                                           datetime  \
0  <class 'float'>  <class 'pandas._libs.tslibs.timestamps.Timesta...   
1  <class 'float'>      <class 'pandas._libs.tslibs.nattype.NaTType'>   

                 bool              string  
0      <class 'bool'>       <class 'str'>  
1     <class 'float'>  <class 'NoneType'> 

You can use the following approach:

mapping = [(float, 0), (bool, False), (str, 'na'), (pd.Timestamp, pd.Timestamp('20200118'))]

df_type = df.applymap(type)
df_notna = df.notna()
for t, v in mapping:
    mask = ((df_type == t) & df_notna).any()
    for col in df.columns[mask]:
        df[col].fillna(v, inplace=True)

Output:

   float   datetime   bool string
0    1.0 2018-03-10   True    foo
1    0.0 2020-01-18  False     na

As you can see the time column should be parsed first otherwise it will be identified as a string.

Upvotes: 0

YOLO
YOLO

Reputation: 21719

Just a slight modification needed in your function:

# set to timestamp
ex_df['d'] = pd.to_datetime(ex_df['d'])

def handle_nas(df, default_date='2020-01-01'):
    """
    :param df: a dataframe
    :param d: current iterations run_date
    :return: a data frame with replacement of na values as either 0 for numeric fields, 'na' for text and False for bool
    """
    for f in df.columns:

        # integer
        if df[f].dtype == "int":
            df[f] = df[f].fillna(0)

        # dates
        elif df[f].dtype == '<M8[ns]':
            df[f] = df[f].fillna(pd.to_datetime(default_date))

        # boolean
        elif df[f].dtype == 'bool':
            df[f] = df[f].fillna(True)

        # string
        else:
            df[f] = df[f].fillna('na')

    return df

     a    b      c         d
0   1  abc   True 2020-01-01
1   2   na  False 2020-01-02
2   3  def     na 2020-01-01
3  na  hij   True 2020-01-04

You can replace the possible na's by doing this:

possible_nas = ['na','None','null','nan','none'] # add whatever you think are nas
ex_df = ex_df.applymap(lambda x: None if x in possible_nas else x)

Why NA values are a pain in the data frames ?

# sample dfs
df = pd.DataFrame({'a':['nan','bl','bo'], 'b':[None, 1, 2], 'c': ['none','null','go']})

# check the type of nulls
type(df['b'][0])
numpy.float64

And, none of the value in possible_nas match this type. So, such values will never be replaced.

Upvotes: 2

Related Questions