Reputation: 610
I have a manufacturing dataset which contains only 3 columns.
Column 1. WorkStationID
Column 2. ProductID
Column 3. Error(1 or 0)
I'm trying to predict error(1 or 0) as a classification problem. But there are 50 unique workstation and 130 unique productID, so when I transform them to dummy variables, dataframe becomes huge.
So, my question is, are dimensionality reduction techniques suitable for dummy variables? In reality I have only 2 variable(workstation and product) sounds like no need to do any reduction. Or any feature importance techniques are suitable? What does it mean if it indicates that 5 different workstation is useless?
Thanks in advance
Upvotes: 0
Views: 2768
Reputation: 6639
If you do not want too many dummy variables, one thing to consider is binary encoding. In many cases when I had such problems, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
In your case, if you have 50 workstations, you can reduce from 49 features (one hot) to 6 features (binary encoded, as 2 power 6 is 64).
After doing this, you can also try out the feature-selector library from Will Koehrsen. You can plot feature importance graph to see if you can further get rid of features that do not add value to your prediction. May be you can come down from 6 to lesser number of variables.
It usually gives a beautiful bar chart which helps visualize the importance of different features, and lets us play around further with the features.
Upvotes: 1