Minahil
Minahil

Reputation: 21

how does voting work between two Tree-Based classifiers that are combined using hard voting in scikit?

For a classification task, I am using sklearn VotingClassifier to ensemble Random Forest and Extra-Tree Classifier with parameter set to voting='hard'. I don't understand how it works correctly since both Tree-Based models already give final prediction using voting technique. How can they work in combination using hard voting? Also if there is a tie between two models? Can anyone explain this with example?

Upvotes: 0

Views: 1123

Answers (1)

jottbe
jottbe

Reputation: 4521

You can look that up from the source code of the voting classifier. For short, it doesn't make much sense to use two classifiers with hard-voting. Rather use soft-voting.

The reason is, that in hard voting modus, the sklearn VotingClassifier votes for the mayority vote and with two it only gets interesting if there is a tie. In case there are as many zeros as there are ones in a binary classification, the hard voting classifier will vote for 0.

You can simply test that by looking at the code, that it executed:

First set up the data for the experiment:

import numpy as np
# create a random int array with values 0 and 1
# with 20 rows (20 predictions) of 10 voters (10 columns)
a = np.random.randint(0, 2, size=(20,10))

# then produce some tie-lines with different combinations
a[0,:] = [0]*5 + [1]*5  # the first 5 predict 0 the next 5 predict 1
a[1,:] = [1]*5 + [0]*5  # vice versa
a[2,:] = [0,1]*5 # the first predicts 0 then 1 follows 0 and 0 follows 1
a[3,:] = [1,0]*5 # the same, just starting with 1

# if you want to check, how many ones you have, do:
a.sum(axis=1)

Now see, what the voter code does with this. The voter code for hard voting is (the code below simulates the case, where you have equally weighted classiifiers weights=[1]*10):

np.apply_along_axis(
                lambda x: np.argmax(
                    np.bincount(x, weights=[1.0]*10)),
                axis=1, arr=a)

The result is:

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])

You see, that for the first four entries (the ties we manually introduced), the result is 0. So in case of a tie, the voting classifier will always choose 0 (if you choose the same weight for each of the classifiers). Note, that the weights are not only used to resolve ties, so you could have a classifier which has the double weight of another classifier, so you might even get a tie with 3 classifiers this way. But whenever the sum of predictionweight for all 0-predicting classifiers is equal to the sumf of predicitonweight for all 1-predicting classifiers, the voting classifier will predict 0 rather than 1.

Here is the relevant code: Sklearn Voting code and Description of numpy.argmax

Upvotes: 2

Related Questions