Eklil Khan
Eklil Khan

Reputation: 63

How to use categorical data in decision tree in python

I have a dataset from which I have extracted 12 features for the task of coreference resolution using decision trees. Some examples of these features are:

After creating all these features to extract the results from the dataset I don't know how to select the root node or how to use the sci-kit learn decision tree algorithm because the data is not structured and is categorical. A paper I read mentionned entropy and information gain but all the example of these two attributes are based on a structured dataset.

Upvotes: 0

Views: 7358

Answers (2)

Isbister
Isbister

Reputation: 946

If you have diverse features of different categories, and don't want to spend time on encoding them yourself, I would recommend using the CatBoost framework which is also faster than the standard scikit implementations of trees.

Check this kaggle for implementation!

Upvotes: 1

Rocky Li
Rocky Li

Reputation: 5958

Use one-hot encoding.

df = pd.get_dummies(df, [categorical_columns_you_want_to_encode])

If there ended up to be too many columns, you can preprocess your column to drop values that are not as common - e.g less than 1% to avoid having too many columns.

Upvotes: 0

Related Questions