Make regression model with categorical data with Scikit-Learn

Question

I have a CSV file with more than 10 columns, some of those columns have categorical data, some categorical columns has only yes and no values, some columns have colors (green, blue, red...) and some columns have other string values.

Is there a way to make the regression model with all columns?

I know that yes and no values can be represented as 1 and 0, but I have read that its not good to represent color names or city names with numbers. Is there a better / correct way to do this?

This is the simple code with dummy data:

import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({'par1':[1,3,5,7,9, 11,13],
                   'par2':[0.2, 0.4, 0.5, 0.7, 1, 1.2, 1.45],
                   'par3':['yes', 'no', 'no', 'yes', 'no', 'yes', 'no'],
                   'par4':['blue', 'red', 'red', 'blue', 'green', 'green', 'blue'],
                   'output':[103, 310, 522, 711, 921, 1241, 1451]})

print(df)

features = df.iloc[:,:-1]
result = df.iloc[:,-1]

reg = LinearRegression()
model = reg.fit(features, result)

prediction = model.predict([[2, 0.33, 'no', 'red']])

reg_score = reg.score(features, result)

print(prediction, reg_score)

In the real dataset that I'm using, those string values are very important to the dataset, so I can't just remove that column

Joe Halliwell · Accepted Answer

You would typically "one-hot encode" categorical variables. This is also called "adding dummy variables".

You will also want to "standardize" the numerical variables.

Scikit-learn makes this easy:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

t = ColumnTransformer(transformers=[
    ('onehot', OneHotEncoder(), ['par3', 'par4']),
    ('scale', StandardScaler(), ['par1', 'par2'])
], remainder='passthrough') # Default is to drop untransformed columns

t.fit_transform(df)

Finally, you'll need to transform your input in the same way before running it through the model.

Bringing it all together you get:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler


df = pd.DataFrame({'par1':[1,3,5,7,9, 11,13],
                   'par2':[0.2, 0.4, 0.5, 0.7, 1, 1.2, 1.45],
                   'par3':['yes', 'no', 'no', 'yes', 'no', 'yes', 'no'],
                   'par4':['blue', 'red', 'red', 'blue', 'green', 'green', 'blue'],
                   'output':[103, 310, 522, 711, 921, 1241, 1451]})

t = ColumnTransformer(transformers=[
    ('onehot', OneHotEncoder(), ['par3', 'par4']),
    ('scale', StandardScaler(), ['par1', 'par2'])
], remainder='passthrough')

# Transform the features
features = t.fit_transform(df.iloc[:,:-1])
result = df.iloc[:,-1]

# Train the linear regression model
reg = LinearRegression()
model = reg.fit(features, result)

# Generate a prediction
example = t.transform(pd.DataFrame([{
    'par1': 2, 'par2': 0.33, 'par3': 'no', 'par4': 'red'
}]))
prediction = model.predict(example)
reg_score = reg.score(features, result)
print(prediction, reg_score)

Make regression model with categorical data with Scikit-Learn

Answers (2)

Related Questions