Reputation: 581
I am trying to create a model which predicts results column below:
Date Open High Close Result
1/22/2010 25.95 31.29 30.89 0.176104
2/19/2010 23.98 24.22 23.60 -0.343760
3/19/2010 21.46 23.16 22.50 0.124994
4/23/2010 21.32 21.77 21.06 -0.765601
5/21/2010 55.41 55.85 49.06 0.302556
The code I am using is:
import pandas
from sklearn.tree import DecisionTreeClassifier
dataset = pandas.read_csv('data.csv')
X = dataset.drop(columns=['Date','Result'])
y = dataset.drop(columns=['Date', 'Open', 'High', 'Close'])
model = DecisionTreeClassifier()
model.fit(X, y)
But I am getting an error:
ValueError: Unknown label type: 'continuous'
Suggestion for using other algorithms are also welcome.
Upvotes: 5
Views: 21474
Reputation: 803
You are using DecisionTreeClassifier
which is a classifier and will only predict categorical values such as 0
or 1
but your Result
column is continuous so you should use DecisionTreeRegressor
Upvotes: 4
Reputation: 1896
Few suggestions
Regression
but these requires model like ARIMA.As for the error DecisionTreeClassifier
is supposed to be used for identifying categories like 1, 2, 3, 4, .. so on but only for a limit set of classes.
For a series like your Results
which is continuous and fractional series, you should a regression like models or ARIMA like time series ML Models.
Upvotes: 1
Reputation: 175
In ML, it's important as a first step to consider the nature of your problem. Is it a regression or classification problem? Do you have target data (supervised learning) or is this a problem where you don't have a target and want to learn more about your data's inherent structure (such as unsupervised learning). Then, consider what steps you need to take in your pipeline to prepare your data (preprocessing).
In this case, you are passing floats (floating point numbers) to a Classifier (DecisionTreeClassifier). The problem with this is that a classifier generally separates distinct classes, and so this classifier expects a string
or an integer
type to distinguish different classes from each other (this is known as the "target"). You can read more about this in an introduction to classifiers.
The problem you seek to solve is to determine a continuous numerical output, Result
. This is known as a regression problem, and so you need to use a Regression algorithm (such as the DecisionTreeRegressor). You can try other regression algorithms out once you have this simple one working, and this is a good place to start as it is a fairly straight forward one to understand, it is fairly transparent, it is fast, and easily implemented - so decision trees were a great choice of starting point!
As a further note, it is important to consider preprocessing your data. You have done some of this simply by separating your target from your input data:
X = dataset.drop(columns=['Date','Result'])
y = dataset.drop(columns=['Date', 'Open', 'High', 'Close'])
However, you may wish to look into preprocessing further, particularly standardisation of your data. This is often a required step for whichever ML algorithm you implement to be able to interpret your data. There's a saying that goes: "Garbage in, garbage out".
Part of preprocessing sometimes requires you to change the data type of a given column. The error posted in your question, at face value, leads one to think that the issue on hand is that you need to change data types. But, as explained, in the case of your problem, it wouldn't help to do that, given that you seek to use regression to determine a continuous output.
Upvotes: 12