Reputation: 93
I am trying to work with the RandomForestRegressor. Using the RandomForestClassifier I seemed to be able to receive variable outcome of +/-1. However using the RandomForestRegressor I only get a constant value when I try to predict.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from pandas_datareader import data
import csv
import statsmodels.api as sm
data = pd.read_csv('C:\H\XPA.csv')
data['pct move']=data['XP MOVE']
# Features construction
data.dropna(inplace=True)
# X is the input variable
X = data[[ 'XPSpread', 'stdev300min']]
# Y is the target or output variable
y = data['pct move']
# Total dataset length
dataset_length = data.shape[0]
# Training dataset length
split = int(dataset_length * 0.75)
# Splitiing the X and y into train and test datasets
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
clf = RandomForestRegressor(n_estimators=1000)
# Create the model on train dataset
model = clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
data['strategy_returns'] = data['pct move'].shift(-1) * -model.predict(X)
print(model.predict(X_test))
Output:
[4.05371547e-07 4.05371547e-07 4.05371547e-07 ... 4.05371547e-07
4.05371547e-07 4.05371547e-07]
The output is stationary while the y data is this:
0 -0.0002
1 0.0000
2 -0.0002
3 0.0002
4 0.0003
...
29583 0.0014
29584 0.0010
29585 0.0046
29586 0.0018
29587 0.0002
x-data:
XPSpread stdev300min
0 1.0 0.0002
1 1.0 0.0002
2 1.0 0.0002
3 1.0 0.0002
4 1.0 0.0002
... ... ...
29583 6.0 0.0021
29584 6.0 0.0021
29585 19.0 0.0022
29586 9.0 0.0022
29587 30.0 0.0022
Now when I change this problem to a classification problem I do get a relative good prediction of the sign. However when I change it to a regression I get a stationary outcome. Any suggestions how I can improve this?
Upvotes: 0
Views: 203
Reputation: 60317
It may very well be the case that, with only two features, there is not enough information there for a numeric prediction (i.e. regression); while in a "milder" classification setting (predicting just the sign, as you say) you have some success.
The low number of features is not the only possible issue; judging from the few samples you have posted, one can easily see that, for example, your first 5 samples have identical features ([1.0, 0.0002]
), while their corresponding y
values can be anywhere in [-0.0002, 0.0003]
- and the situation is similar for your samples #29583 & 29584. On the other hand, your samples #3 ([1.0, 0.0002]
) and #29587 ([30.0, 0.0022]
) look very dissimilar, but they end up having the same y
value of 0.0002
.
If the rest of your dataset has similar characteristics, it may just not be amenable to a decent regression modeling.
Last but not least, If your data are in any way "ordered" along some feature (they look like, but of course I cannot be sure with that small a sample), the situation is getting worse. What I suggest is to split your data using train_test_split
, instead of doing it manually:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, shuffle=True)
which hopefully, due to shuffling, will result in a more favorable split. You may want to remove duplicate rows from the dataframe before shuffling and splitting (they are never a good idea) - see pandas.DataFrame.drop_duplicates
.
Upvotes: 1