How to pass in multidimensional data to xgboost model

Question

Essentially this is what I have for xgboost

model = XGBClassifier()

model.fit(X_train, y_train)

Where X_train and y_train are numpy arrays. My problem is that X_train seems to have to take the format of a numeric matrix where each row is a set of numbers such as:

[1, 5, 3, 6]

However, the data I have is in the format of a set of vectors. Each vector consists of a number between 1, 3 and a confidence interval that is a between 0, 1. So a row of my X_train would look like:

[[1, .84], [2, .5], [3, .44], [2, .76]]

However I can't figure out how I would pass in data in this format to xgboost. I'm fairly new to xgboost so I've been reading documentation but I can't seem to find what I'm looking for. Thanks for any help.

Vivek Kumar · Accepted Answer

Approach 1:

I would recommend to make 3 columns for each system, which has the probability of all 3 classes by that system. And then combine these columns for all systems.

Something like this:

Index   Sys1_Cls1  Sys1_Cls2  Sys1_Cls3  Sys2_Cls1  Sys2_Cls2  Sys2_Cls3  \
    0   0.310903   0.521839   0.167258   0.034925   0.509087   0.455988   
    1   0.402701   0.315302   0.281997   0.044981   0.137326   0.817693   
    2   0.272443   0.409210   0.318347   0.591514   0.170707   0.237778   
    3   0.272599   0.304014   0.423388   0.175838   0.324275   0.499887   
    4   0.339352   0.341860   0.318788   0.574995   0.169180   0.255824   

       Sys3_Cls1  Sys3_Cls2  Sys3_Cls3  Sys4_Cls1  Sys4_Cls2  Sys4_Cls3  
       0.173293   0.279590   0.547117   0.441913   0.251394   0.306692  
       0.224656   0.425100   0.350244   0.430451   0.382072   0.187476  
       0.198573   0.603826   0.197600   0.412734   0.185472   0.401795  
       0.011399   0.598892   0.389709   0.057813   0.651510   0.290677  
       0.025087   0.478595   0.496317   0.539963   0.288596   0.171440

Here 'Sys1_Cls1' means the probability of System1 for class1 and so on..

This can be your X. For y, assign the actual class, you have for that sample.

So the shape of your X will be (n_samples, 12) and y will be (n_samples).

If you have more systems, these can be similarly appended as features (Columns).

Approach 2: Other is that you take sum, or average (or weighted average) of all systems for a particular class. In this you will have only 3 columns irrespective of systems.

  Index       Cls1      Cls2      Cls3
    0       0.187362  0.151723  0.660914
    1       0.378118  0.293932  0.327950
    2       0.424903  0.278271  0.296825
    3       0.342273  0.274003  0.383723
    4       0.405926  0.104094  0.489981

Where the particular values in a column can be:

cls1_val = Sys1_Cls1 + Sys2_Cls1 + Sys3_Cls1 + ...

OR

cls1_val = (Sys1_Cls1 + Sys2_Cls1 + Sys3_Cls1 + ...)/number_systems

OR

cls1_val = (weight1 x Sys1_Cls1 + weight2 x Sys2_Cls1 + weight3 x Sys3_Cls1 + ...)/number_systems

Here shape of X will be (n_samples, 3)

Try with different approaches and keep what works best.

Note: By the way, what you are trying to achieve here by combining the probabilities of different systems and then finally predicting a final class is called as Stacking. See these resources for more info:

How to pass in multidimensional data to xgboost model

Answers (1)

Related Questions