scaling only numeric values in data frame that contain string

Question

i am in python , i am try to make scaling to data frame

subject_id hour_measure         urinecolor   blood pressure                  
3          1.00                 red          40
           1.15                 red          high
4          2.00              yellow          low

as it that contain numeric and text columns the following code gives me error

 #MinMaxScaler for Data
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(X)

it gives me error as the data frame contain string , how can i tell python to only scale columns contain numbers , and also scale numeric values in string columns .

jezrael · Accepted Answer

Convert non numeric values to missing values and then use alternative solution for scaling, last replace missing values back to original:

print (df)
   subject_id  hour_measure urinecolor blood pressure
0           3          1.00        red             40
1           3          1.15        red           high
2           4          2.00     yellow            low
3           5          5.00     yellow            100

df = df.set_index('subject_id')

df1 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2 = (df1 - df1.min()) / (df1.max() - df1.min())

df = df2.combine_first(df)
print (df)
            hour_measure urinecolor blood pressure
subject_id                                        
3                 0.0000        red              0
3                 0.0375        red           high
4                 0.2500     yellow            low
5                 1.0000     yellow              1

First solution:

I suggest replace text columns to numeric by dictionary like:

dbp = {'high': 150, 'low': 60}

df['blood pressure'] = df['blood pressure'].replace(dbp)

All together:

#if subject_id are numeric convert them to index
df = df.set_index('subject_id')

dbp = {'high': 150, 'low': 60}
#replace to numbers and convert to integers
df['blood pressure'] = df['blood pressure'].replace(dbp).astype(int)

print (df)
            hour_measure urinecolor  blood pressure
subject_id                                         
3                   1.00        red              40
3                   1.15        red             150
4                   2.00     yellow              60

print (df.dtypes)
hour_measure      float64
urinecolor         object
blood pressure      int32
dtype: object

from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler(copy=True, feature_range=(0, 1))
#select only numeric columns
X = scaler.fit_transform(df.select_dtypes(np.number))
print (X)
[[0.         0.        ]
 [0.15       1.        ]
 [1.         0.18181818]]

Detail:

print (df.select_dtypes(np.number))
            hour_measure  blood pressure
subject_id                              
3                   1.00              40
3                   1.15             150
4                   2.00              60

scaling only numeric values in data frame that contain string

Answers (2)

Related Questions