taga
taga

Reputation: 3885

ValueError: Unknown label type: 'continuous-multioutput' when fitting data

I want to predict multiple outcomes based on one input parameter with scikit-learn's MultiOutputClassifier. For some reason I always get this error, and I do not know whats the problem:

ValueError: Unknown label type: 'continuous-multioutput'

I have tried to make my_data['Clicked'] to be categorical data, I have tried this my_data['Clicked'] = my_data['Clicked'].astype('category'), but it gives me the same error.

I have tried the same code on some simple dummy database and it works perfectly. This is the code that works:

from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
multi_output_clf.fit(results.values.reshape(-1, 1),variables)

x = multi_output_clf.predict([[100]])
print(x)

For the code above, everything works, but for the code below, i get the error. I do not know whats the problem because i have just used bigger data set and values based on which I predict the parameters are only 0's and 1's. Zeros and Ones should be classes(categories) like yes and no but if i change them to 'yes' and 'no'i get the error that cannot convert string to float. Why this is not continuous 'outcome': [101, 905, 182, 268, 646, 624, 465] but series of 0's and 1's are continuous?

from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

variables = my_data[['Clicked']] #values are integers, only 0 and 1 (0 = not clicked , 1 = clicked)
results = my_data[['Daily Time on Site', 'Age', 'Gender']] #values are integers and floats

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
multi_output_clf.fit(variables.values.reshape(-1, 1),results)

x = multi_output_clf.predict([1])
print(x)

Below is the part of full dataset that I have used (it gives me the same error):

dic = {'Daily Time on Site': [59.99, 88.91, 66.00, 74.53, 69.88, 47.64, 83.07, 69.57],
       'Age': [23,33,48,30,20,49,37,48],
       'Gender': [1, 0, 1, 1, 1, 0, 1, 1],
       'Clicked': [0, 0, 1, 0, 0, 1, 0, 1]}

my_data = pd.DataFrame(dic)

variables = my_data[['Clicked']] #values are only 0 and 1 (0 = not clicked , 1 = clicked)
results = my_data[['Daily Time on Site', 'Age', 'Gender']] #values are integers and floats

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs', multi_class='ovr'))
multi_output_clf.fit(variables.values.reshape(-1, 1),results)

x = multi_output_clf.predict([1])
print(x)

Upvotes: 6

Views: 24710

Answers (2)

Venkatachalam
Venkatachalam

Reputation: 16966

I think you need to go for MultiOutputRegressor(), as your output variables seems to be continuous.

Try the following change:


variables  = my_data[['Clicked']] #values are only 0 and 1 (0 = not clicked , 1 = clicked)
results = my_data[['Daily Time on Site', 'Age', 'Gender']] #values are integers and floats

multi_output_clf = MultiOutputRegressor(LinearRegression())
multi_output_clf.fit(variables.values.reshape(-1, 1),results)

Update:

>>> pd.cut(my_data['Daily Time on Site'],
...        3, labels=["low", "medium", "high"])

0       low
1      high
2    medium
3    medium
4    medium
5       low
6      high
7    medium

Note: It is not advisable to have integers as your categories, because number of categories might shoot really high when your variable has a huge range of values. Please bucket them into smaller groups say 10 or 20 and then consider them as categorical values.

Upvotes: 2

taga
taga

Reputation: 3885

I found solution, to eliminate this error, all values that are output variables, in this case ['Daily Time on Site', 'Age', 'Gender'] needs to be integers, not floats

dic = {'Daily Time on Site': [59.99, 88.91, 66.00, 74.53, 69.88, 47.64, 83.07, 69.57],
       'Age': [23,33,48,30,20,49,37,48],
       'Gender': [1, 0, 1, 1, 1, 0, 1, 1],
       'Clicked': [0, 0, 1, 0, 0, 1, 0, 1]}

my_data = pd.DataFrame(dic)
my_data['Daily Time on Site']= my_data['Daily Time on Site'].round(0)

variables = my_data[['Clicked']] #values are only 0 and 1 (0 = not clicked , 1 = clicked)
results = my_data[['Daily Time on Site', 'Age', 'Gender']] #values are integers and floats

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs', multi_class='ovr'))
multi_output_clf.fit(variables.values.reshape(-1, 1),results)

x = multi_output_clf.predict([1])
print(x)

Upvotes: 1

Related Questions