Reputation: 39
Code:
num_features = [feature for feature in x.columns if x[feature].dtypes != 'O']
x[num_features].replace('N',value=0)
from sklearn.preprocessing import StandardScaler
stds = StandardScaler()
x[num_features]= stds.fit(x)
Error:
ValueError: could not convert string to float: 'N'
Upvotes: 2
Views: 2412
Reputation: 15568
To select numeric columns:
numeric_features = df.select_dtypes(include=['numeric'])
You method right now will not exclude Booleans(True and False), Category values, or Date and Time.
Just having numeric values is not enough. We may also have missing values. We need to deal with them before normalisation:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# numeric transformer
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
# now before we transform we impute missing values with columns median.
normalise_numeric_features = numeric_transformer.fit_transform(numeric_feature)
This might not be enough, as you might have different data types in your dataset each needing their own transformer. You can do:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
# select features types
numeric_features = df.select_dtypes(include=['numeric'])
categorical_features = df.select_dtypes(include=['category'])
# create transformer for each feature types
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# create a process
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# if you have a model you can add it in a pipeline to
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
# now we can use train the model
clf.fit(df[train_features], df[train_target])
# the fit will take the features, do the preprocessing(impute, normalise etc) that is fit_transforms, and train your model.
clf.predict(df[test_features])
# predict will just transform your features, using the learned metrics from train_features.
Hope this helps get the most of scikit-learn.
Upvotes: 2