What is the best way to encode a large number of categorical features?

Question

I'm trying to make a small data science tool (kinda like a mini version of WEKA). Now, I have these datasets that have large amounts of features (70-100+), and they're mostly categorical. I'm using Python sklearn for the Machine Learning logic and I need to convert these categories into numeric values according to the sklearn errors I've gotten.

Given this, One Hot Encoding isn't an option because it will enlarge the dimensionality too much.

I've researched other ways that may work like frequency encoding, label encoding, etc. But I'm not really sure what to choose in my case.

Also, how WEKA actually handles these? I inputted my datasets in WEKA and they worked fine, they gave me good results!

What is the best way to encode a large number of categorical features?

Answers (1)

Related Questions