Reputation: 85
I'm trying to make a small data science tool (kinda like a mini version of WEKA). Now, I have these datasets that have large amounts of features (70-100+), and they're mostly categorical. I'm using Python sklearn for the Machine Learning logic and I need to convert these categories into numeric values according to the sklearn errors I've gotten.
Given this, One Hot Encoding isn't an option because it will enlarge the dimensionality too much.
I've researched other ways that may work like frequency encoding, label encoding, etc. But I'm not really sure what to choose in my case.
Also, how WEKA actually handles these? I inputted my datasets in WEKA and they worked fine, they gave me good results!
Upvotes: -1
Views: 459
Reputation: 2608
That depends on the algorithm: Some handle categorical attributes natively, like J48 (Weka's C4.5 implementation), which performs multi-way splits on categorical attributes. Others have to convert the data, like SMO (support vector machine), which binarizes nominal attributes and increases the number of attributes to learn from.
Upvotes: 0