Reputation: 2068
I am using a data-set to make some predictions using the multi-variable regression techniques. I have to predict the salary of the employees based on some independent variables like gender, percentage, date of birth, marks in different subjects, degree, specialization etc.
Numeric parameters(eg- marks and percentage in different subjects) are fine to be used with the regression model. But how do we normalize the non-numeric parameters (gender, date of birth, degree, specialization) here ?
P.S. : I am using the scikit-learn : machine learning in python package.
Upvotes: 3
Views: 1240
Reputation: 680
You want to encode your categorical parameters.
Note that date is not a categorical parameter! Convert it into a unix timestamp (seconds since epoch) and you have a nice parameter on which you can regress.
Upvotes: 1
Reputation: 10833
"Normaliz[ing] non-numeric parameters" is actually a huge area of regression. The most common treatment is to turn each categorical into a set of binary variables called dummy variables.
Each categorical with n
values should be converted into n-1
dummy variables. So for example, for gender, you might have one variable, "female", that would be either 0 or 1 at each observation. Why n-1
and not n
? Because you want to avoid the dummy variable trap, where basically the intercept column of all 1's can be reconstructed from a linear combination of your dummy columns. In relatively non-technical terms, that's bad because it messes up the linear algebra needed to do the regression.
I am not so familiar with the scikit-learn library but I urge you to make sure that whatever methods you do use, you ensure that each categorical becomes n-1
new columns, and not n
.
Upvotes: 0
Reputation: 1366
I hope this can help you. The whole description of how to use that function is available on this link.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html
Upvotes: 0