Reputation: 39
I have already created dummy variables for all my categorical columns, but I need to split my data into train and test set, with my target being "Loan_Status". I am confused because after creating dummy variables, this creates two new columns for "Loan_Status", so when or how would I split my data and create the target?
# Convert the categorical features into dummy variables.
df_dummies = pd.get_dummies(df1, columns=['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Loan_Status'])
df_dummies.head()
It looked like this before, so how would i create the target to be loan status, wouldnt splitting the data before dummys create issues?
Upvotes: 0
Views: 288
Reputation: 2851
As a rule of thumb, you should stick to pd.get_dummies(drop_first=True, ...)
to avoid creating redundant columns, as N-1 columns contain full information about N possible values.
However, one hot encoding is a bit excessive for binary values, you're probably better off just using something like .map()
.
Upvotes: 0