Is one hot encoding required for this data set?

Below is the data set from the UCI data repository. I want to build a regression model taking platelets count as the dependent variable(y) and the rest as features/inputs.

However, there are few categorical variables like such as anemia, sex, smoking, and DEATH_EVENT in the data set in numeric form.

My questions are:

Should I perform 'one-hot encoding' on these variables before building a regression model?
Also, I observe the values are in various ranges, so should I even scale the data set before applying the regression model?

Upvotes: 1

Answers (3)

DejaVuSansMono

Reputation: 797

If those are truly binary categories, you don't have to one hot encode. They are already encoded.

Upvotes: 1

ajay sagar

Reputation: 199

You don't have to use one-hot encoding as those columns already have numerical values. Although if those numerical values are actually string instead of int or float then you should use one-hot encoding on them. About scaling the data, the variation is considerable, so you should scale it to avoid your regression model being biased towards high values.

Upvotes: 0

Ricky

Reputation: 2750

1.Should I perform 'one-hot encoding' on these variables before building a regression model?

Yup, you should one-hot encode the categorical variables. You can use like below:

columns_to_category = ['sex', 'smoking','DEATH_EVENT']
df[columns_to_category] = df[columns_to_category].astype('category') # change datetypes to category
df = pd.get_dummies(df, columns=columns_to_category) # One hot encoding the categories

2.If so, only one hot encoding is sufficient or should I perform even label encoding?

One hot encoding should be sufficient I guess.

3.Also, I observe the values are in various ranges, so should I even scale the data set before applying the regression model?

Yes you can use either StandardScaler() or MinMaxScaler() to get better results and then inverse scale the predictions. Also, make sure you scale the test and train separately and not combined because in real life your test will be not realized so yo need to scale accordingly to avoid such errors.

Upvotes: 3

Is one hot encoding required for this data set?

Answers (3)

Related Questions