Reputation: 341
I'm testing out a Decision Tree for the first time and am getting a perfect score for my algorithm's performance. This doesn't make sense because the dataset that I am using is AAPL stock price for a bunch of different variables which obviously the algorithm can't detect perfectly.
CSV:
Date,Open,High,Low,Close,Adj Close,Volume
2010-01-04,10430.6904296875,10604.9697265625,10430.6904296875,10583.9599609375,10583.9599609375,179780000
2010-01-05,10584.5595703125,10584.5595703125,10522.51953125,10572.01953125,10572.01953125,188540000
I think the reason it might not be working is because I am essentially just feeding in the answers when training the model and it is just regurgitating those when I try and score the model.
Code:
# Data Sorting
df = pd.read_csv('AAPL_test.csv')
df = df.drop('Date', axis=1)
df = df.dropna(axis='rows')
inputs = df.drop('Close', axis='columns')
target = df['Close']
print(inputs.dtypes)
print(target.dtypes)
# Changing dtypes
lab_enc = preprocessing.LabelEncoder()
target_encoded = lab_enc.fit_transform(target)
# Model
model = tree.DecisionTreeClassifier()
model.fit(inputs, target_encoded)
print(f'SCORE = {model.score(inputs, target_encoded)}')
I've also thought about randomizing the order of the CSV files, that could help but I'm not sure how I would do that. I could randomize the df
at the top of the code but I'm pretty sure that, that would equally skew the results for both dataframes and therefore there would be no difference to what I am doing now. Otherwise, I could individually randmoize the datasets but I think that would mess with the model training or scoring because the test data won't be associated with the right data? I'm not too sure.
Upvotes: 1
Views: 188
Reputation: 1093
Most probably your model is overfitted. I think you did not split your dataset into two part: One is for training and the other is testing. Test data will help you to understand if your model overfit or underfit.
For more information:
Upvotes: 2