Reputation: 127
I have a data set with several features, one of those features are categorical but have tree structure on its value. For example, if this categorical features have value a, b, c, d, e, f, g, h, I, j, k. then following image reveal the tree relationship of the values:
the raw feature do not incorporate this relationship (so that feature only take one column). Now, I want to incorporate this relationship, but I still want the feature be vector form.
my solution for this is: create a binary value column for each node. so in this example, the feature can present by binary vector of length 11. And a feature value equal e can be represented as <1, 1, 0,1,0,0, 0, 0,0,0,0> (shows below)
where the 1st element indicate the first level b; 2nd element indicate the second level a; 3rd, 4th, 5th, and 6th element indicate third level d,e,g and j respectively; 7th element indicate second level c; 8th, 9th 10th and 11th element indicate third level f,h,i and k respectively.
The reason I think this would work is you can recover the tree from this vector representation, so I think the information is not lost during this transformation.
The main purpose for this transformation is I want to use some machine learning algorithm on this dataset, so I want the data set be more informative.
I want to know whether this transformation is valid, if not valid, why? And whether there is better way to do this.
Upvotes: 1
Views: 232