CAVP33
CAVP33

Reputation: 85

What is the best way to encode a large number of categorical features?

I'm trying to make a small data science tool (kinda like a mini version of WEKA). Now, I have these datasets that have large amounts of features (70-100+), and they're mostly categorical. I'm using Python sklearn for the Machine Learning logic and I need to convert these categories into numeric values according to the sklearn errors I've gotten.

Given this, One Hot Encoding isn't an option because it will enlarge the dimensionality too much.

I've researched other ways that may work like frequency encoding, label encoding, etc. But I'm not really sure what to choose in my case.

Also, how WEKA actually handles these? I inputted my datasets in WEKA and they worked fine, they gave me good results!

Upvotes: -1

Views: 459

Answers (1)

fracpete
fracpete

Reputation: 2608

That depends on the algorithm: Some handle categorical attributes natively, like J48 (Weka's C4.5 implementation), which performs multi-way splits on categorical attributes. Others have to convert the data, like SMO (support vector machine), which binarizes nominal attributes and increases the number of attributes to learn from.

Upvotes: 0

Related Questions