Reputation: 63
I have a dataset from which I have extracted 12 features for the task of coreference resolution using decision trees. Some examples of these features are:
distance_feature(): distance between i and j according to the number of sentences. output: 0 or 1
Ispronoun_feature(): this feature is set to true if a noun phrase is a pronoun.
appositive_feature(): This feature checks if j
is in apposition of i
.
After creating all these features to extract the results from the dataset I don't know how to select the root node or how to use the sci-kit learn decision tree algorithm because the data is not structured and is categorical. A paper I read mentionned entropy and information gain but all the example of these two attributes are based on a structured dataset.
Upvotes: 0
Views: 7358
Reputation: 946
If you have diverse features of different categories, and don't want to spend time on encoding them yourself, I would recommend using the CatBoost framework which is also faster than the standard scikit implementations of trees.
Check this kaggle for implementation!
Upvotes: 1
Reputation: 5958
Use one-hot encoding.
df = pd.get_dummies(df, [categorical_columns_you_want_to_encode])
If there ended up to be too many columns, you can preprocess your column to drop values that are not as common - e.g less than 1% to avoid having too many columns.
Upvotes: 0