Reputation: 299
i am in python , i am try to make scaling to data frame
subject_id hour_measure urinecolor blood pressure
3 1.00 red 40
1.15 red high
4 2.00 yellow low
as it that contain numeric and text columns the following code gives me error
#MinMaxScaler for Data
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(X)
it gives me error as the data frame contain string , how can i tell python to only scale columns contain numbers , and also scale numeric values in string columns .
Upvotes: 2
Views: 2945
Reputation: 743
Another approach as following: (I've added new row the see scaled values in blood pressure)
hour_measure urinecolor blood pressure temp_column
0 1.00 red 40 40
1 1.15 red high 0
2 2.00 yellow low 0
3 3.00 yellow 20 20
df['temp_column'] = df['blood pressure'].values
df['temp_column'] = df['temp_column'].apply(lambda x: 0 if str(x).isalpha() == True else x)
This will create a new temp_column with the numeric values of blood pressure column.
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
df['hour_measure'] = scaler.fit_transform(df['hour_measure'].values.reshape(-1, 1))
df['temp_column'] = scaler.fit_transform(df['temp_column'].values.reshape(-1 ,1))
I've applied MinMaxScaler to temp_column which contains numeric values of blood pressure. And i just put scaled numeric values to back in blood pressure column.
numeric_rows = pd.to_numeric(df['blood pressure'], errors='coerce').dropna().index.tolist()
print('Index of numeric values in blood pressure column: ', numeric_rows)
for i in numeric_rows:
df['blood pressure'].iloc[i] = df['temp_column'].iloc[i]
df = df.drop(['temp_column'], axis=1)
Result:
hour_measure urinecolor blood pressure
0 0.000 red 1
1 0.075 red high
2 0.500 yellow low
3 1.000 yellow 0.5
Upvotes: 0
Reputation: 862591
Convert non numeric values to missing values and then use alternative solution for scaling, last replace missing values back to original:
print (df)
subject_id hour_measure urinecolor blood pressure
0 3 1.00 red 40
1 3 1.15 red high
2 4 2.00 yellow low
3 5 5.00 yellow 100
df = df.set_index('subject_id')
df1 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2 = (df1 - df1.min()) / (df1.max() - df1.min())
df = df2.combine_first(df)
print (df)
hour_measure urinecolor blood pressure
subject_id
3 0.0000 red 0
3 0.0375 red high
4 0.2500 yellow low
5 1.0000 yellow 1
First solution:
I suggest replace text columns to numeric by dictionary like:
dbp = {'high': 150, 'low': 60}
df['blood pressure'] = df['blood pressure'].replace(dbp)
All together:
#if subject_id are numeric convert them to index
df = df.set_index('subject_id')
dbp = {'high': 150, 'low': 60}
#replace to numbers and convert to integers
df['blood pressure'] = df['blood pressure'].replace(dbp).astype(int)
print (df)
hour_measure urinecolor blood pressure
subject_id
3 1.00 red 40
3 1.15 red 150
4 2.00 yellow 60
print (df.dtypes)
hour_measure float64
urinecolor object
blood pressure int32
dtype: object
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler(copy=True, feature_range=(0, 1))
#select only numeric columns
X = scaler.fit_transform(df.select_dtypes(np.number))
print (X)
[[0. 0. ]
[0.15 1. ]
[1. 0.18181818]]
Detail:
print (df.select_dtypes(np.number))
hour_measure blood pressure
subject_id
3 1.00 40
3 1.15 150
4 2.00 60
Upvotes: 1