Reputation: 979
So I have a training data set like this (but much larger):
Group PID Var1 Var2 Best
0 111 1 1 1 1
1 111 2 2 1 2
2 111 3 1 2 2
3 112 1 1 2 2
4 112 2 2 1 1
5 113 1 1 2 2
6 113 2 1 1 2
7 113 3 2 1 1
8 113 4 3 2 2
Where each group (rows that share a group number) contains a list of people (each unique PID within each group), and one person within the group with Best = 1, and the rest with Best = 2. My goal is to use this training data predict which person in each group is the best (Best = 1) based on Var1 and Var2.
I have played around with Scikit learn and have tried to use the random forest model to predict Best for the test data, but it does not account for the groups and can assign Best = 1 for more than one PID per group.
I was wondering how to train/run the model so that it learns to assign a single Best = 1 per group instead of assigning it across all rows and groups. Pointing me in the direction of helpful resources would be just as good as I'm not exactly sure where to go for help on this.
Upvotes: 0
Views: 1068
Reputation: 77910
When a feature is not a well-ordered metric -- such as a discrete classification -- we use one-hot encoding. This means that for N classes (different values) of the original feature, we create a family of N features, exactly one of which will be "good" (usually 1), while the others are "bad" (typically 0). You can read this as a set of Boolean functions: isGroup111(), isGroup112(), ...
Group111 Group112 Group113 PID Var1 Var2 Best
0 1 0 0 1 1 1 1
1 1 0 0 2 2 1 2
2 1 0 0 3 1 2 2
3 0 1 0 1 1 2 2
4 0 1 0 2 2 1 1
5 0 0 1 1 1 2 2
6 0 0 1 2 1 1 2
7 0 0 1 3 2 1 1
8 0 0 1 4 3 2 2
Upvotes: 2