Reputation: 2069
I'm experimenting with several sklearn classifiers in a Voting Classifier for ensembling.
To test, I have a dataframe with set of columns that represent tool skills (a numerical value from 0 to 10 representing how much the person knows about the skill) and a "Fit to Job" column that is the class variable. Example:
import pandas as pd
df = pd.DataFrame(columns=["Python", "Scikit-learn", "Pandas", "Fit to Job"])
total_mock_samples= 100
for i in range(total_mock_samples):
df=df.append(mockResults(df.columns, 'Fit to Job', good_values=i > total_mock_samples/2), ignore_index=True)
#Fills dataframe with mock data
#Output like:
print(np.array(df))
#[[1. 3. 6. 1.]
# [3. 2. 3. 0.]
# [1. 4. 0. 0.]
# ...
# [7. 8. 8. 1.]
# [8. 7. 9. 1.]]
Then I mount my ensemble classifiers:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np
X = np.array(df[df.columns[:-1]])
y = np.array(df[df.columns[-1]])
rfc = RandomForestClassifier(n_estimators=10)
svc = SVC(kernel='linear')
knn = KNeighborsClassifier(n_neighbors=5)
nb = GaussianNB()
lr = LinearRegression()
ensemble = VotingClassifier(estimators=[("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc), ("Linear Reg.",lr)])
Finally, I try to evaluate it with Cross validation, like so:
cval_score = cross_val_score(ensemble, X, y, cv=10)
But I'm getting the following error:
TypeError Traceback (most recent call last)
<ipython-input-13-f7c01fa872d2> in <module>
182 ensemble = VotingClassifier(estimators=[("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc), ("Linear Reg.",lr)])
183
--> 184 cval_score = cross_val_score(ensemble, X, y, cv=10)
[...]
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
I've checked other answers, but they all refer to numpy data conversions. The error is happening inside the cross validation phase. I tried to apply their solutions with no luck.
I've also attempted to change data type prior to calculating the score with no success.
Maybe someone have a more keen eye to see where's the problem.
EDIT 01: Mock results generator function
def mockResults(columns, result_column_name='Fit', min_value = 0, max_value=10, good_values=False):
mock_res = {}
for column in columns:
mock_res[column] = 0
if column == result_column_name:
if good_values == True:
mock_res[column] = float(1)
else:
mock_res[column] = float(0)
elif good_values == True:
mock_res[column] = float(random.randrange(int(max_value*0.7), max_value))
else:
mock_res[column] = float(random.randrange(min_value, int(max_value*0.5)))
return mock_res
Upvotes: 2
Views: 304
Reputation: 1907
df = pd.DataFrame(columns=["Python", "Scikit-learn", "Pandas", "Fit to Job"], data=np.random.randint(1, 10,size=(400,4)))
class LinearRegressionInt(LinearRegression):
def predict(self,X):
predictions = self._decision_function(X)
return np.asarray(predictions, dtype=np.int64).ravel()
...
lr = LinearRegressionInt()
...
ensemble = VotingClassifier(estimators=[("lr",lr),("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc)] )
cval_score = cross_val_score(ensemble, X, y, cv=10)
cval_score
array([ 0.09090909, 0.11904762, 0.17073171, 0.14634146, 0.17073171,
0.15384615, 0.07692308, 0.15384615, 0.10810811, 0.08108108])
Reference: An Typeerror with VotingClassifier
Upvotes: 1