Reputation: 3895
I have a CSV file with more than 10 columns, some of those columns have categorical data, some categorical columns has only yes
and no
values, some columns have colors (green
, blue
, red
...) and some columns have other string values.
Is there a way to make the regression model with all columns?
I know that yes
and no
values can be represented as 1 and 0, but I have read that its not good to represent color names or city names with numbers.
Is there a better / correct way to do this?
This is the simple code with dummy data:
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'par1':[1,3,5,7,9, 11,13],
'par2':[0.2, 0.4, 0.5, 0.7, 1, 1.2, 1.45],
'par3':['yes', 'no', 'no', 'yes', 'no', 'yes', 'no'],
'par4':['blue', 'red', 'red', 'blue', 'green', 'green', 'blue'],
'output':[103, 310, 522, 711, 921, 1241, 1451]})
print(df)
features = df.iloc[:,:-1]
result = df.iloc[:,-1]
reg = LinearRegression()
model = reg.fit(features, result)
prediction = model.predict([[2, 0.33, 'no', 'red']])
reg_score = reg.score(features, result)
print(prediction, reg_score)
In the real dataset that I'm using, those string values are very important to the dataset, so I can't just remove that column
Upvotes: 2
Views: 6568
Reputation: 1177
You would typically "one-hot encode" categorical variables. This is also called "adding dummy variables".
You will also want to "standardize" the numerical variables.
Scikit-learn makes this easy:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
t = ColumnTransformer(transformers=[
('onehot', OneHotEncoder(), ['par3', 'par4']),
('scale', StandardScaler(), ['par1', 'par2'])
], remainder='passthrough') # Default is to drop untransformed columns
t.fit_transform(df)
Finally, you'll need to transform your input in the same way before running it through the model.
Bringing it all together you get:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
df = pd.DataFrame({'par1':[1,3,5,7,9, 11,13],
'par2':[0.2, 0.4, 0.5, 0.7, 1, 1.2, 1.45],
'par3':['yes', 'no', 'no', 'yes', 'no', 'yes', 'no'],
'par4':['blue', 'red', 'red', 'blue', 'green', 'green', 'blue'],
'output':[103, 310, 522, 711, 921, 1241, 1451]})
t = ColumnTransformer(transformers=[
('onehot', OneHotEncoder(), ['par3', 'par4']),
('scale', StandardScaler(), ['par1', 'par2'])
], remainder='passthrough')
# Transform the features
features = t.fit_transform(df.iloc[:,:-1])
result = df.iloc[:,-1]
# Train the linear regression model
reg = LinearRegression()
model = reg.fit(features, result)
# Generate a prediction
example = t.transform(pd.DataFrame([{
'par1': 2, 'par2': 0.33, 'par3': 'no', 'par4': 'red'
}]))
prediction = model.predict(example)
reg_score = reg.score(features, result)
print(prediction, reg_score)
Upvotes: 2
Reputation: 2699
You are asking a general question about regression, not just regarding SciKit, so I'll try to answer in general terms.
You are right about yes/no variables, you can encode them as binary variables, 0 and 1. But, the same principle holds for colors and other categorical variables:
You create n-1
dummy binary variables, n
being the number of categories. Each of the dummy variables is basically saying whether your observation falls into the corresponding category. You declare one of the them, e.g. blue, to be the default category, and encode it by setting all dummy variables to zero. I.e. if it is neither red, nor green, nor any other available color, it must be blue.
The other categories are encoded by setting the corresponding dummy variable to 1
and leaving all others at zero. So for red
you could set dummy1 = 1
, for green
dummy2 = 1
etc.
Binary variables are just a special case of this encoding, where you have two categories, which you encode with 1 (= 2-1) variable.
Upvotes: 1