mayaaa
mayaaa

Reputation: 299

scaling only numeric values in data frame that contain string

i am in python , i am try to make scaling to data frame

subject_id hour_measure         urinecolor   blood pressure                  
3          1.00                 red          40
           1.15                 red          high
4          2.00              yellow          low

as it that contain numeric and text columns the following code gives me error

 #MinMaxScaler for Data
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(X)

it gives me error as the data frame contain string , how can i tell python to only scale columns contain numbers , and also scale numeric values in string columns .

Upvotes: 2

Views: 2945

Answers (2)

talatccan
talatccan

Reputation: 743

Another approach as following: (I've added new row the see scaled values in blood pressure)

       hour_measure urinecolor blood pressure  temp_column
0          1.00        red             40           40
1          1.15        red           high            0
2          2.00     yellow            low            0
3          3.00     yellow             20           20

df['temp_column'] = df['blood pressure'].values
df['temp_column'] = df['temp_column'].apply(lambda x: 0 if str(x).isalpha() == True else x)

This will create a new temp_column with the numeric values of blood pressure column.

scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
df['hour_measure'] = scaler.fit_transform(df['hour_measure'].values.reshape(-1, 1))
df['temp_column'] = scaler.fit_transform(df['temp_column'].values.reshape(-1 ,1))

I've applied MinMaxScaler to temp_column which contains numeric values of blood pressure. And i just put scaled numeric values to back in blood pressure column.

numeric_rows = pd.to_numeric(df['blood pressure'], errors='coerce').dropna().index.tolist()
print('Index of numeric values in blood pressure column: ', numeric_rows)
for i in numeric_rows:
    df['blood pressure'].iloc[i] = df['temp_column'].iloc[i]
df = df.drop(['temp_column'], axis=1)

Result:

   hour_measure urinecolor blood pressure
0         0.000        red              1
1         0.075        red           high
2         0.500     yellow            low
3         1.000     yellow            0.5

Upvotes: 0

jezrael
jezrael

Reputation: 862591

Convert non numeric values to missing values and then use alternative solution for scaling, last replace missing values back to original:

print (df)
   subject_id  hour_measure urinecolor blood pressure
0           3          1.00        red             40
1           3          1.15        red           high
2           4          2.00     yellow            low
3           5          5.00     yellow            100

df = df.set_index('subject_id')

df1 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2 = (df1 - df1.min()) / (df1.max() - df1.min())

df = df2.combine_first(df)
print (df)
            hour_measure urinecolor blood pressure
subject_id                                        
3                 0.0000        red              0
3                 0.0375        red           high
4                 0.2500     yellow            low
5                 1.0000     yellow              1

First solution:

I suggest replace text columns to numeric by dictionary like:

dbp = {'high': 150, 'low': 60}

df['blood pressure'] = df['blood pressure'].replace(dbp)

All together:

#if subject_id are numeric convert them to index
df = df.set_index('subject_id')

dbp = {'high': 150, 'low': 60}
#replace to numbers and convert to integers
df['blood pressure'] = df['blood pressure'].replace(dbp).astype(int)

print (df)
            hour_measure urinecolor  blood pressure
subject_id                                         
3                   1.00        red              40
3                   1.15        red             150
4                   2.00     yellow              60

print (df.dtypes)
hour_measure      float64
urinecolor         object
blood pressure      int32
dtype: object

from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler(copy=True, feature_range=(0, 1))
#select only numeric columns
X = scaler.fit_transform(df.select_dtypes(np.number))
print (X)
[[0.         0.        ]
 [0.15       1.        ]
 [1.         0.18181818]]

Detail:

print (df.select_dtypes(np.number))
            hour_measure  blood pressure
subject_id                              
3                   1.00              40
3                   1.15             150
4                   2.00              60

Upvotes: 1

Related Questions