m.hoko
m.hoko

Reputation: 22

How to handle large Sets of categorical Data

I'm a beginner in machine learning. i have a large Data Set with lots of categorical data. The data is nominal. I want to apply algorithmns like SVM and decision tree with Python and scikit-learn to to find patterns in the data.

My problem is, it that i dont know how to best handle that kind of data. I read a lot about One-Hot Encoding. The examples are all quite easy, like with three different colors. In my data there are around 30 different categorical features. And in those features are around 200 different "values". If i use simple One-Hot Encoding the data frame gets really big and i can hardly use any algorithm on the data because i run out of ram.

So whats the best approach here? Use a sql database for the encoded tables? How is this done in the "real" world?

Thanks in advance for your answers!

Upvotes: 0

Views: 1850

Answers (2)

Kubra Altun
Kubra Altun

Reputation: 405

To be honest, this problem has brought a huge new realization to me.

First of all, it is important to differentiate your categorical data based on their content; a.k.a. nominal or ordinal data.

Nominal data stands for any kind of categorical data, which has no ranking or sequential relationship between different values, i.e. milk, egg, bread, and so on.

On the other hand, ordinal data refers to values with a sequential relationship, i.e., primary school, high school, college, master, and so on (in other words when you assign primary school 1, you assign high school 2 since there is a ranking).

Afterward, you can use many encoding approaches, for the detailed explanation you can use here : Smarter ways for encoding categorical data

Upvotes: 0

Rocky Li
Rocky Li

Reputation: 5958

Sklearn does not handle categorical features with decision trees and random forest - it requires them to be converted to one-hot encoded columns. Realistically though, there are a slightly better alternative:

enter image description here

This is called binary encoding which will separate all type, much better than numerical encoding for categorical columns.

Another way to approach this problem is using clipping. The idea of clipping is to only register the largest categories, e.g. All categories that account for 5%+ of all values, and encode the rest as 'tail'. This is another method to reduce dimensionality.

Upvotes: 1

Related Questions