Reputation: 3885
I want to predict multiple outcomes based on one input parameter with scikit-learn
's MultiOutputClassifier
.
For some reason I always get this error, and I do not know whats the problem:
ValueError: Unknown label type: 'continuous-multioutput'
I have tried to make my_data['Clicked']
to be categorical data, I have tried this my_data['Clicked'] = my_data['Clicked'].astype('category')
, but it gives me the same error.
I have tried the same code on some simple dummy database and it works perfectly. This is the code that works:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
'par_2': [1, 3, 1, 2, 3, 3, 2],
'outcome': [101, 905, 182, 268, 646, 624, 465]}
df = pd.DataFrame(dic)
variables = df.iloc[:,:-1]
results = df.iloc[:,-1]
multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
multi_output_clf.fit(results.values.reshape(-1, 1),variables)
x = multi_output_clf.predict([[100]])
print(x)
For the code above, everything works, but for the code below, i get the error.
I do not know whats the problem because i have just used bigger data set and values based on which I predict the parameters are only 0's and 1's. Zeros and Ones should be classes(categories) like yes
and no
but if i change them to 'yes' and 'no'i get the error that cannot convert string to float
. Why this is not continuous 'outcome': [101, 905, 182, 268, 646, 624, 465] but series of 0's and 1's are continuous?
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
variables = my_data[['Clicked']] #values are integers, only 0 and 1 (0 = not clicked , 1 = clicked)
results = my_data[['Daily Time on Site', 'Age', 'Gender']] #values are integers and floats
multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
multi_output_clf.fit(variables.values.reshape(-1, 1),results)
x = multi_output_clf.predict([1])
print(x)
Below is the part of full dataset that I have used (it gives me the same error):
dic = {'Daily Time on Site': [59.99, 88.91, 66.00, 74.53, 69.88, 47.64, 83.07, 69.57],
'Age': [23,33,48,30,20,49,37,48],
'Gender': [1, 0, 1, 1, 1, 0, 1, 1],
'Clicked': [0, 0, 1, 0, 0, 1, 0, 1]}
my_data = pd.DataFrame(dic)
variables = my_data[['Clicked']] #values are only 0 and 1 (0 = not clicked , 1 = clicked)
results = my_data[['Daily Time on Site', 'Age', 'Gender']] #values are integers and floats
multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs', multi_class='ovr'))
multi_output_clf.fit(variables.values.reshape(-1, 1),results)
x = multi_output_clf.predict([1])
print(x)
Upvotes: 6
Views: 24710
Reputation: 16966
I think you need to go for MultiOutputRegressor()
, as your output variables seems to be continuous.
Try the following change:
variables = my_data[['Clicked']] #values are only 0 and 1 (0 = not clicked , 1 = clicked)
results = my_data[['Daily Time on Site', 'Age', 'Gender']] #values are integers and floats
multi_output_clf = MultiOutputRegressor(LinearRegression())
multi_output_clf.fit(variables.values.reshape(-1, 1),results)
Update:
>>> pd.cut(my_data['Daily Time on Site'],
... 3, labels=["low", "medium", "high"])
0 low
1 high
2 medium
3 medium
4 medium
5 low
6 high
7 medium
Note: It is not advisable to have integers as your categories, because number of categories might shoot really high when your variable has a huge range of values. Please bucket them into smaller groups say 10 or 20 and then consider them as categorical values.
Upvotes: 2
Reputation: 3885
I found solution, to eliminate this error, all values that are output variables, in this case ['Daily Time on Site', 'Age', 'Gender']
needs to be integers, not floats
dic = {'Daily Time on Site': [59.99, 88.91, 66.00, 74.53, 69.88, 47.64, 83.07, 69.57],
'Age': [23,33,48,30,20,49,37,48],
'Gender': [1, 0, 1, 1, 1, 0, 1, 1],
'Clicked': [0, 0, 1, 0, 0, 1, 0, 1]}
my_data = pd.DataFrame(dic)
my_data['Daily Time on Site']= my_data['Daily Time on Site'].round(0)
variables = my_data[['Clicked']] #values are only 0 and 1 (0 = not clicked , 1 = clicked)
results = my_data[['Daily Time on Site', 'Age', 'Gender']] #values are integers and floats
multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs', multi_class='ovr'))
multi_output_clf.fit(variables.values.reshape(-1, 1),results)
x = multi_output_clf.predict([1])
print(x)
Upvotes: 1