Reputation: 9806
clf = RandomForestClassifier(min_samples_leaf=20)
clf.fit(X_train, y)
prob_pos= clf.predict_proba(X_test)
Dimensions:
(Pdb) print X_train.shape,X_test.shape,y.shape
(1422392L, 14L) (233081L, 14L) (1422392L, 6L)
Output:
(Pdb) prob_pos
[array([[ 0.96133658, 0.03866342],
[ 0.93514554, 0.06485446],
[ 0.91520408, 0.08479592],
...,
[ 0.95826389, 0.04173611],
[ 0.97130832, 0.02869168],
[ 0.93223876, 0.06776124]]), array([[ 0.9907225 , 0.0092775 ],
[ 0.94489664, 0.05510336],
[ 0.98428571, 0.01571429],
...,
[ 0.96415476, 0.03584524],
[ 0.99193939, 0.00806061],
[ 0.98918919, 0.01081081]]), array([[ 0.9907225 , 0.0092775 ],
[ 0.98253968, 0.01746032],
[ 0.98166667, 0.01833333],
...,
[ 0.96415476, 0.03584524],
[ 0.99444444, 0.00555556],
[ 0.99004914, 0.00995086]]), array([[ 1. , 0. ],
[ 0.99642857, 0.00357143],
[ 0.98082011, 0.01917989],
...,
[ 0.96978897, 0.03021103],
[ 0.97467974, 0.02532026],
[ 1. , 0. ]]), array([[ 1. , 0. ],
[ 1. , 0. ],
[ 0.98238095, 0.01761905],
...,
[ 1. , 0. ],
[ 0.99661017, 0.00338983],
[ 0.99428571, 0.00571429]]), array([[ 1. , 0. ],
[ 1. , 0. ],
[ 0.99285714, 0.00714286],
...,
[ 0.99705882, 0.00294118],
[ 0.97885167, 0.02114833],
[ 0.98688312, 0.01311688]])]
I don't understand why the probablity is not X-train_samples x 6?
Upvotes: 0
Views: 681
Reputation: 2749
Since y.shape
is (1422392L, 6L), you have 6 various outputs. Therefore, you have a list of 6 arrays as probability output. Since each of the arrays has 2 columns, I conclude that you have 2 classes for each output. Are there indeed 2 classes? Then everything looks fine to me.
If 6 classes are one-hot-encoded like [1,0,0,0,0,0]
, this is effectively 2-classes for 6 outputs. Then the first array in the list gives you "0" and "1" probabilities of the first output, the second array the "0" and "1" probabilities for the second output and so on.
You are practically solving the multi-output problem as described here in scikit-learn documentaion, see "1.10.3. Multi-output problems".
The simplest way to get probabilities of 6 classes would be to encode your classes as 1,2,3,4,5,6 and get y
with 1 column. Then you will get one array with 6 columns as probabilities
If you have both classes sometimes, like [1,0,1,0,0,1]
, then your problem is intrinsically multi-output (in my comment it says 'multi-class' which is a mispring). To get probabilities of 6 classes you need to collect the second columns of each array in the list. The code is
prob_nx6 = np.array([arr[:,1] for arr in prob_pos]).T
Now that I am editing this answer I come up with a simpler code
prob_nx6 = np.hstack(prob_pos)[:,1::2]
This will give you a 2D array of shape (n,6) (n=1422392 in your case). If you want a list of n arrays each of length 6, the simple code is
prob_nx6_liofarr = list(np.hstack(prob_pos)[:,1::2])
If inside this list each element must be list and not array (that is list of lists), the code is
prob_nx6_liofli = np.hstack(prob_pos)[:,1::2].tolist()
Upvotes: 2