Keep column names when changing from numpy to pandas using imputer for numeric and non-numeric variables

Question

I have a dataframe, X_train. I am trying to create the below:

Imputer

It fills NAN values with median for numeric variables.
It fills NAN values with most frequent value for non- numeric variables.

X_train before imputer(code to generate)

import pandas as pd
import numpy as np
X_train = pd.DataFrame({'Default': [1,0,0,0,0,0,1],'Income': [250000,400000,'NAN',440000,500000,700000,800000],'Age': [20,30, 40,35,25,40,'NAN'],'Name':['Allen','Sara','Lily','Rock','David','Rose','Mat'],'Gender':['M','F','F','M','M','F','M'],'Type of job': ['Skilled','Unskilled','Super skilled','Super skilled','NAN','Skilled','Skilled'],'Amt of credit':['NAN',30000,50000,80000,40000,100000,300000],'Years employed':[1,10,12,6,4,13,12]})
X_train=X_train.replace('NAN',np.NaN)

Code for imputer:

import pandas as pd

X_train_numeric=X_train.select_dtypes(include=['int', 'float']).columns
X_train_non_numeric=X_train.select_dtypes(exclude=['int', 'float']).columns.drop('Name')
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
t = [('num', SimpleImputer(strategy='median'), X_train_numeric),
('cat', SimpleImputer(strategy='most_frequent'), X_train_non_numeric)]
transformer = ColumnTransformer(transformers=t, remainder='passthrough')
X_train = transformer.fit_transform(X_train) #numpy array

#code used to change numpy array to pandas
X_train = pd.DataFrame(X_train, index=range(1, X_train.shape[0] + 1),
                          columns=range(1, X_train.shape[1] + 1))

X_train after imputer

Expected output

I need the column names on top. How to do that
I wanted to drop Name from X_train but its is not dropped in the final outcome despite doing drop('Name') at end of X_train_non_numeric.

jezrael · Accepted Answer

Remove column by DataFrame.drop, then first replace mising values by numeric with DataFrame.median (non numeric columns are omitted) and then replace first values of DataFrame.mode:

X_train = X_train.fillna(X_train.median()).fillna(X_train.drop('Name', axis=1).mode().iloc[0])
print (X_train)
   Default    Income   Age   Name Gender    Type of job  Amt of credit  \
0        1  250000.0  20.0  Allen      M        Skilled        65000.0   
1        0  400000.0  30.0   Sara      F      Unskilled        30000.0   
2        0  470000.0  40.0   Lily      F  Super skilled        50000.0   
3        0  440000.0  35.0   Rock      M  Super skilled        80000.0   
4        0  500000.0  25.0  David      M        Skilled        40000.0   
5        0  700000.0  40.0   Rose      F        Skilled       100000.0   
6        1  800000.0  32.5    Mat      M        Skilled       300000.0   

   Years employed  
0               1  
1              10  
2              12  
3               6  
4               4  
5              13  
6              12

Detail:

print (X_train.median())
Default                0.0
Income            470000.0
Age                   32.5
Amt of credit      65000.0
Years employed        10.0
dtype: float64

Another idea is create Series with all columns with remove Name column for non numeric and numeric columns and pass to DataFrame.fillna:

s = X_train.drop('Name', axis=1).select_dtypes(object).mode().iloc[0].append(X_train.median())
print (s)
Gender                  M
Type of job       Skilled
Default                 0
Income             470000
Age                  32.5
Amt of credit       65000
Years employed         10
dtype: object

X_train = X_train.fillna(s)
print (X_train)
   Default    Income   Age   Name Gender    Type of job  Amt of credit  \
0        1  250000.0  20.0  Allen      M        Skilled        65000.0   
1        0  400000.0  30.0   Sara      F      Unskilled        30000.0   
2        0  470000.0  40.0   Lily      F  Super skilled        50000.0   
3        0  440000.0  35.0   Rock      M  Super skilled        80000.0   
4        0  500000.0  25.0  David      M        Skilled        40000.0   
5        0  700000.0  40.0   Rose      F        Skilled       100000.0   
6        1  800000.0  32.5    Mat      M        Skilled       300000.0   

   Years employed  
0               1  
1              10  
2              12  
3               6  
4               4  
5              13  
6              12

Your solution should be changed:

X_train=X_train.replace('NAN',np.NaN)

$removed column Name
X_train = X_train.drop('Name', axis=1)
#original order of columns
cols = X_train.columns

X_train_numeric=X_train.select_dtypes(include=['int', 'float']).columns
#joined columns numeric and non numeric
X_train_non_numeric=X_train.select_dtypes(exclude=['int', 'float']).columns
new = X_train_numeric.tolist() + X_train_non_numeric.tolist()

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
t = [('num', SimpleImputer(strategy='median'), X_train_numeric),
('cat', SimpleImputer(strategy='most_frequent'), X_train_non_numeric)]
transformer = ColumnTransformer(transformers=t, remainder='passthrough')
X_train = transformer.fit_transform(X_train) #numpy array

#DataFrame constructor with new columns names and added reindex for change by original order
X_train = pd.DataFrame(X_train, columns=new).reindex(cols, axis=1)
print (X_train)
  Default  Income   Age Gender    Type of job Amt of credit Years employed
0       1  250000    20      M        Skilled         65000              1
1       0  400000    30      F      Unskilled         30000             10
2       0  470000    40      F  Super skilled         50000             12
3       0  440000    35      M  Super skilled         80000              6
4       0  500000    25      M        Skilled         40000              4
5       0  700000    40      F        Skilled        100000             13
6       1  800000  32.5      M        Skilled        300000             12

Keep column names when changing from numpy to pandas using imputer for numeric and non-numeric variables

Answers (2)

Related Questions