Tyr
Tyr

Reputation: 610

Categorical variables' dimensionality reduction

I have a manufacturing dataset which contains only 3 columns.

Column 1. WorkStationID
Column 2. ProductID
Column 3. Error(1 or 0)

I'm trying to predict error(1 or 0) as a classification problem. But there are 50 unique workstation and 130 unique productID, so when I transform them to dummy variables, dataframe becomes huge.

So, my question is, are dimensionality reduction techniques suitable for dummy variables? In reality I have only 2 variable(workstation and product) sounds like no need to do any reduction. Or any feature importance techniques are suitable? What does it mean if it indicates that 5 different workstation is useless?

Thanks in advance

Upvotes: 0

Views: 2768

Answers (1)

Ankur Sinha
Ankur Sinha

Reputation: 6639

If you do not want too many dummy variables, one thing to consider is binary encoding. In many cases when I had such problems, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.

Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:

cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0 
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1

In your case, if you have 50 workstations, you can reduce from 49 features (one hot) to 6 features (binary encoded, as 2 power 6 is 64).

After doing this, you can also try out the feature-selector library from Will Koehrsen. You can plot feature importance graph to see if you can further get rid of features that do not add value to your prediction. May be you can come down from 6 to lesser number of variables.

It usually gives a beautiful bar chart which helps visualize the importance of different features, and lets us play around further with the features.

enter image description here


PS: This is an open ended question that you have asked and the answer I have given is based on my experience. There is no particular "right or wrong" about it, and you can only try it and know if it works in your favor for your use case.

Upvotes: 1

Related Questions