Reputation: 43
Essentially this is what I have for xgboost
model = XGBClassifier()
model.fit(X_train, y_train)
Where X_train
and y_train
are numpy arrays.
My problem is that X_train
seems to have to take the format of a numeric matrix where each row is a set of numbers such as:
[1, 5, 3, 6]
However, the data I have is in the format of a set of vectors. Each vector consists of a number between 1, 3 and a confidence interval that is a between 0, 1. So a row of my X_train
would look like:
[[1, .84], [2, .5], [3, .44], [2, .76]]
However I can't figure out how I would pass in data in this format to xgboost. I'm fairly new to xgboost so I've been reading documentation but I can't seem to find what I'm looking for. Thanks for any help.
Upvotes: 2
Views: 6927
Reputation: 36619
Approach 1:
I would recommend to make 3 columns for each system, which has the probability of all 3 classes by that system. And then combine these columns for all systems.
Something like this:
Index Sys1_Cls1 Sys1_Cls2 Sys1_Cls3 Sys2_Cls1 Sys2_Cls2 Sys2_Cls3 \
0 0.310903 0.521839 0.167258 0.034925 0.509087 0.455988
1 0.402701 0.315302 0.281997 0.044981 0.137326 0.817693
2 0.272443 0.409210 0.318347 0.591514 0.170707 0.237778
3 0.272599 0.304014 0.423388 0.175838 0.324275 0.499887
4 0.339352 0.341860 0.318788 0.574995 0.169180 0.255824
Sys3_Cls1 Sys3_Cls2 Sys3_Cls3 Sys4_Cls1 Sys4_Cls2 Sys4_Cls3
0.173293 0.279590 0.547117 0.441913 0.251394 0.306692
0.224656 0.425100 0.350244 0.430451 0.382072 0.187476
0.198573 0.603826 0.197600 0.412734 0.185472 0.401795
0.011399 0.598892 0.389709 0.057813 0.651510 0.290677
0.025087 0.478595 0.496317 0.539963 0.288596 0.171440
Here 'Sys1_Cls1'
means the probability of System1 for class1 and so on..
This can be your X
. For y
, assign the actual class, you have for that sample.
So the shape of your X
will be (n_samples, 12)
and y
will be (n_samples)
.
If you have more systems, these can be similarly appended as features (Columns).
Approach 2: Other is that you take sum, or average (or weighted average) of all systems for a particular class. In this you will have only 3 columns irrespective of systems.
Index Cls1 Cls2 Cls3
0 0.187362 0.151723 0.660914
1 0.378118 0.293932 0.327950
2 0.424903 0.278271 0.296825
3 0.342273 0.274003 0.383723
4 0.405926 0.104094 0.489981
Where the particular values in a column can be:
cls1_val = Sys1_Cls1 + Sys2_Cls1 + Sys3_Cls1 + ...
OR
cls1_val = (Sys1_Cls1 + Sys2_Cls1 + Sys3_Cls1 + ...)/number_systems
OR
cls1_val = (weight1 x Sys1_Cls1 + weight2 x Sys2_Cls1 + weight3 x Sys3_Cls1 + ...)/number_systems
Here shape of X
will be (n_samples, 3)
Try with different approaches and keep what works best.
Note: By the way, what you are trying to achieve here by combining the probabilities of different systems and then finally predicting a final class is called as Stacking. See these resources for more info:
Upvotes: 5