Abhishek Bhatia
Abhishek Bhatia

Reputation: 9806

RandomForestClassifier not predicting probability for all classes

clf = RandomForestClassifier(min_samples_leaf=20)
clf.fit(X_train, y)
prob_pos= clf.predict_proba(X_test)

Dimensions:

 (Pdb) print X_train.shape,X_test.shape,y.shape
    (1422392L, 14L) (233081L, 14L) (1422392L, 6L)

Output:

(Pdb) prob_pos
[array([[ 0.96133658,  0.03866342],
       [ 0.93514554,  0.06485446],
       [ 0.91520408,  0.08479592],
       ...,
       [ 0.95826389,  0.04173611],
       [ 0.97130832,  0.02869168],
       [ 0.93223876,  0.06776124]]), array([[ 0.9907225 ,  0.0092775 ],
       [ 0.94489664,  0.05510336],
       [ 0.98428571,  0.01571429],
       ...,
       [ 0.96415476,  0.03584524],
       [ 0.99193939,  0.00806061],
       [ 0.98918919,  0.01081081]]), array([[ 0.9907225 ,  0.0092775 ],
       [ 0.98253968,  0.01746032],
       [ 0.98166667,  0.01833333],
       ...,
       [ 0.96415476,  0.03584524],
       [ 0.99444444,  0.00555556],
       [ 0.99004914,  0.00995086]]), array([[ 1.        ,  0.        ],
       [ 0.99642857,  0.00357143],
       [ 0.98082011,  0.01917989],
       ...,
       [ 0.96978897,  0.03021103],
       [ 0.97467974,  0.02532026],
       [ 1.        ,  0.        ]]), array([[ 1.        ,  0.        ],
       [ 1.        ,  0.        ],
       [ 0.98238095,  0.01761905],
       ...,
       [ 1.        ,  0.        ],
       [ 0.99661017,  0.00338983],
       [ 0.99428571,  0.00571429]]), array([[ 1.        ,  0.        ],
       [ 1.        ,  0.        ],
       [ 0.99285714,  0.00714286],
       ...,
       [ 0.99705882,  0.00294118],
       [ 0.97885167,  0.02114833],
       [ 0.98688312,  0.01311688]])]

I don't understand why the probablity is not X-train_samples x 6?

Upvotes: 0

Views: 681

Answers (1)

lanenok
lanenok

Reputation: 2749

Since y.shape is (1422392L, 6L), you have 6 various outputs. Therefore, you have a list of 6 arrays as probability output. Since each of the arrays has 2 columns, I conclude that you have 2 classes for each output. Are there indeed 2 classes? Then everything looks fine to me.

If 6 classes are one-hot-encoded like [1,0,0,0,0,0], this is effectively 2-classes for 6 outputs. Then the first array in the list gives you "0" and "1" probabilities of the first output, the second array the "0" and "1" probabilities for the second output and so on.

You are practically solving the multi-output problem as described here in scikit-learn documentaion, see "1.10.3. Multi-output problems".

The simplest way to get probabilities of 6 classes would be to encode your classes as 1,2,3,4,5,6 and get y with 1 column. Then you will get one array with 6 columns as probabilities

If you have both classes sometimes, like [1,0,1,0,0,1], then your problem is intrinsically multi-output (in my comment it says 'multi-class' which is a mispring). To get probabilities of 6 classes you need to collect the second columns of each array in the list. The code is

prob_nx6 = np.array([arr[:,1] for arr in prob_pos]).T

Now that I am editing this answer I come up with a simpler code

prob_nx6 = np.hstack(prob_pos)[:,1::2] 

This will give you a 2D array of shape (n,6) (n=1422392 in your case). If you want a list of n arrays each of length 6, the simple code is

prob_nx6_liofarr = list(np.hstack(prob_pos)[:,1::2]) 

If inside this list each element must be list and not array (that is list of lists), the code is

prob_nx6_liofli = np.hstack(prob_pos)[:,1::2].tolist() 

Upvotes: 2

Related Questions