Coder
Coder

Reputation: 455

How to make the classifier based on the encoded categorical features?

I am working on a data frame where some part of it, is as follows:

age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
25, Private,226802, 11th,7, Never-married, Machine-op-inspct, Own-child, Black, Male,0,0,40, United-States, <=50K
38, Private,89814, HS-grad,9, Married-civ-spouse, Farming-fishing, Husband, White, Male,0,0,50, United-States, <=50K
28, Local-gov,336951, Assoc-acdm,12, Married-civ-spouse, Protective-serv, Husband, White, Male,0,0,40, United-States, >50K
44, Private,160323, Some-college,10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male,7688,0,40, United-States, >50K
18, ?,103497, Some-college,10, Never-married, ?, Own-child, White, Female,0,0,30, United-States, <=50K
34, Private,198693, 10th,6, Never-married, Other-service, Not-in-family, White, Male,0,0,30, United-States, <=50K
29, ?,227026, HS-grad,9, Never-married, ?, Unmarried, Black, Male,0,0,40, United-States, <=50K

After removing the rows with ' ?' values from the data frame:

cat = [
    'workclass', 'education', 'marital-status', 'occupation', 'relationship',
    'race', 'sex', 'native-country', 'class'
]

# Encode sex column
df["Value"] = np.where((df["sex"] == 'Female'), 0, 1)

# Encode categorical columns
data = df.copy()
for col in cat:
    data = pd.get_dummies(data, columns=[col], prefix = [col])

Now I have a data frame ready for logistic regression to classify sex based on the other features. But I am going to do it step by step, for instance firstly I intend to make the classifier of 'sex' only based on 'workclass', but workclass has been encoded to several new columns (and I don't know their all names), so how should I make the logistic regression model to classify sex just based on all workclass encoded columns? And then the based on the combination of other feature? Also, how to find the best classifier?

Thanks

Upvotes: 1

Views: 54

Answers (1)

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

Pandas adds a prefix for every dummy column. Based on that, you can make the X and y accordingly changing the column name every step of the way -

X = data[[i for i in data.columns if 'workclass' in i]] # change 'workclass' here 
y = data['sex_ Male']

Upvotes: 1

Related Questions