Niv
Niv

Reputation: 940

What's the proper way to present numerical categorical data (specifically hour of day) variable in XGboost?

Is it better to one-hotencode or just leave it as a single numeric variable? I'm reading mixed conclusions on the net:

"Avoid OneHot for high cardinality columns and decision tree-based algorithms." https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

as opposed to

"(onehotencoded) This is the proper representation of a categorical variable for xgboost or any other machine learning tool." XGBoost Categorical Variables: Dummification vs encoding

Upvotes: 1

Views: 1752

Answers (1)

Mischa Lisovyi
Mischa Lisovyi

Reputation: 3223

There are more than 2 schools of thought :). In practice, there are pros and cons to everything and the optimal approach will depend on your data. So the usual path forward is to try all feasible options and choose the one that suits your use case best (not only in terms of metrics, but also in terms of CPU/RAM, if the data and not tiny)

For example, OHE will add multiple columns, which can lead to a large memory footprint in the case of long tables. At the same time OHE looses ordinal information (if feature was ordinal). But this might not be a problem, as trees often tick up relevant dependencies of target on the fly. On the other side, simple ordered numeric representation of the hour keeps memory low and keeps ordered sequence of values. But the issues are that it looses the information about 1 hour following 24, it will work with tree boosters in xgboost, but not with linear booster in xgboost or with other model families outside of xgboost (linear, svm, etc.), and it is not theoretically sound for non-ordinal features (your question seemed general).

Let me add the third school of thought that is applicable in this particular case: you can use cyclic encoding of features that have repetitive cycles (month of the year, hour of the day, etc.). The concept is to use sin and cos functions to encode each value with a fixed period (24 in the case of hour of the day). This allows to keep continuity on the edges and keeps memory under control (only 2 features instead of original numerical ordered representation) and the number of encoded features does not depend on cardinality. There are many discussions that one can find googling, for example, this question: https://datascience.stackexchange.com/q/5990/53060. And I'm sure that there are many implementations of it on the web, I personally use this one in python: https://github.com/MaxHalford/xam/blob/master/docs/feature-extraction.md#cyclic-features. Of course, this does not apply to numerical categorical data in general, but to hour of the day specifically.

But as said on the beginning, I personally would try all of them and see which fits best to the problem at hand. Cyclic encoding can be most conceptually sound for the hour of the day, but might perform worse then other approaches and would be meaningless for a feature like "age group".

Upvotes: 2

Related Questions