taga
taga

Reputation: 3895

Make regression model with categorical data with Scikit-Learn

I have a CSV file with more than 10 columns, some of those columns have categorical data, some categorical columns has only yes and no values, some columns have colors (green, blue, red...) and some columns have other string values.

Is there a way to make the regression model with all columns?

I know that yes and no values can be represented as 1 and 0, but I have read that its not good to represent color names or city names with numbers. Is there a better / correct way to do this?

This is the simple code with dummy data:

import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({'par1':[1,3,5,7,9, 11,13],
                   'par2':[0.2, 0.4, 0.5, 0.7, 1, 1.2, 1.45],
                   'par3':['yes', 'no', 'no', 'yes', 'no', 'yes', 'no'],
                   'par4':['blue', 'red', 'red', 'blue', 'green', 'green', 'blue'],
                   'output':[103, 310, 522, 711, 921, 1241, 1451]})

print(df)

features = df.iloc[:,:-1]
result = df.iloc[:,-1]

reg = LinearRegression()
model = reg.fit(features, result)

prediction = model.predict([[2, 0.33, 'no', 'red']])

reg_score = reg.score(features, result)

print(prediction, reg_score)

In the real dataset that I'm using, those string values are very important to the dataset, so I can't just remove that column

Upvotes: 2

Views: 6568

Answers (2)

Joe Halliwell
Joe Halliwell

Reputation: 1177

You would typically "one-hot encode" categorical variables. This is also called "adding dummy variables".

You will also want to "standardize" the numerical variables.

Scikit-learn makes this easy:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

t = ColumnTransformer(transformers=[
    ('onehot', OneHotEncoder(), ['par3', 'par4']),
    ('scale', StandardScaler(), ['par1', 'par2'])
], remainder='passthrough') # Default is to drop untransformed columns

t.fit_transform(df)

Finally, you'll need to transform your input in the same way before running it through the model.

Bringing it all together you get:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler


df = pd.DataFrame({'par1':[1,3,5,7,9, 11,13],
                   'par2':[0.2, 0.4, 0.5, 0.7, 1, 1.2, 1.45],
                   'par3':['yes', 'no', 'no', 'yes', 'no', 'yes', 'no'],
                   'par4':['blue', 'red', 'red', 'blue', 'green', 'green', 'blue'],
                   'output':[103, 310, 522, 711, 921, 1241, 1451]})

t = ColumnTransformer(transformers=[
    ('onehot', OneHotEncoder(), ['par3', 'par4']),
    ('scale', StandardScaler(), ['par1', 'par2'])
], remainder='passthrough')

# Transform the features
features = t.fit_transform(df.iloc[:,:-1])
result = df.iloc[:,-1]

# Train the linear regression model
reg = LinearRegression()
model = reg.fit(features, result)

# Generate a prediction
example = t.transform(pd.DataFrame([{
    'par1': 2, 'par2': 0.33, 'par3': 'no', 'par4': 'red'
}]))
prediction = model.predict(example)
reg_score = reg.score(features, result)
print(prediction, reg_score)

Upvotes: 2

Igor F.
Igor F.

Reputation: 2699

You are asking a general question about regression, not just regarding SciKit, so I'll try to answer in general terms.

You are right about yes/no variables, you can encode them as binary variables, 0 and 1. But, the same principle holds for colors and other categorical variables:

You create n-1 dummy binary variables, n being the number of categories. Each of the dummy variables is basically saying whether your observation falls into the corresponding category. You declare one of the them, e.g. blue, to be the default category, and encode it by setting all dummy variables to zero. I.e. if it is neither red, nor green, nor any other available color, it must be blue.

The other categories are encoded by setting the corresponding dummy variable to 1 and leaving all others at zero. So for red you could set dummy1 = 1, for green dummy2 = 1 etc.

Binary variables are just a special case of this encoding, where you have two categories, which you encode with 1 (= 2-1) variable.

Upvotes: 1

Related Questions