pythonpandasdataframemachine-learningfeature-selection

Reputation: 11192

How to handle categorical data for preprocessing in Machine Learning

This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.

My Sample DF:

  T-size Gender  Label
0      L      M      1
1      L      M      1
2      M      F      1
3      S      F      0
4      M      M      1
5      L      M      0
6      S      F      1
7      S      F      0
8      M      M      1

I know this following code convert my categorical data into numerical

Type-1:

df['T-size'] = df['T-size'].cat.codes

Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.

For this example I know S < M < L. What should I do when I have want to convert data like above.

Type-2:

In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample

for Male,

(4/5)

for Female,

(2/4)

WKT,

(4/5) > (2/4)

How should I replace for this kind of column?

Can I replace M with (4/5) and F with (2/4) for this problem?

What is the proper way to dealing with column?

help me to understand this better.

Upvotes: 1

Answers (4)

Dan

Reputation: 45762

There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.

Regarding your t-shirts above, you can give a pandas categorical type an order:

df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).

if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.

Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.

Upvotes: 2

Thomas Kimber

Reputation: 11107

It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.

So a dataset of:

Gender
M
F
M
M
F

Would become

Gender_M    Gender_F
1           0
0           1
1           0
1           0
0           1

This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.

Upvotes: 1

Joe

Reputation: 12417

For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:

d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)

Output:

   T-size Gender  Label
0       2      M      1
1       2      M      1
2       1      F      1
3       0      F      0 
4       1      M      1
5       2      M      0
6       0      F      1
7       0      F      0
8       1      M      1

For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.

Upvotes: 1

SantiStSupery

Reputation: 212

If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :

size_mapping = {"S": 1, "M":2 , "L":3}

#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)

This allows you to treat the input as numerical data while preserving the hierarchy

And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.

df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})

For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning

Upvotes: 1

How to handle categorical data for preprocessing in Machine Learning

Answers (4)

Related Questions