Mahsolid
Mahsolid

Reputation: 433

How to train and predict a model using Random Forest?

How can we predict a model using random forest? I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following

t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10

I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y. Similarly, using a simple rolling OLS regression model, we can do it as in the following but I wanted to do it using random forest model.

import pandas as pd

df = pd.read_csv('data_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']], 
                               window_type='rolling', window=5, intercept=True)

I have solved this problem with random forest, which yields df:

t_stamp     X    Y     X_t1    X_t2     X_t3    X_t4    X_t5
0.000543    0   10      NaN     NaN     NaN     NaN     NaN
0.000575    0   10      0.0     NaN     NaN     NaN     NaN
0.041324    1   10      0.0     0.0     NaN     NaN     NaN
0.041331    2   10      1.0     0.0     0.0     NaN     NaN
0.041336    3   10      2.0     1.0     0.0     0.0     NaN
0.041340    4   10      3.0     2.0     1.0     0.0     0.0
0.041345    5   10      4.0     3.0     2.0     1.0     0.0
0.041350    6   10      5.0     4.0     3.0     2.0     1.0
0.041354    7   10      6.0     5.0     4.0     3.0     2.0
 .........................................................   
[ 10.  10.  10.  10. .................................]
MSE: 1.3273548431

This seems to work fine for ranges 5, 10, 15, 20, 22. However, it doesn't seem to work fine for ranges greater than 23 (it prints MSE: 0.0) and this is because, as you can see from the dataset the values of Y are fixed (10) from row 1 - 23 and then changes to another value (20, and so on) from row 24. How can we train and predict a model of such cases based on the last data points?

Upvotes: 0

Views: 2904

Answers (1)

cs95
cs95

Reputation: 402942

It seems with the existing code, when calling dropna, you truncate X but not y. You also train and test on the same data.

Fixing this will give non-zero MSE.

Code:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

df = pd.read_csv('/Users/shivadeviah/Desktop/estimated_pred.csv')

df1 = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(25)})
df1['Y'] = df['Y']
df1 = df1.sample(frac=1).reset_index(drop=True)
df1.dropna(inplace=True)

X = df1.iloc[:, :-1].values
y = df1.iloc[:, -1].values

x = int(len(X) * 0.66)

X_train = X[:x]
X_test = X[x:]
y_train = y[:x]
y_test = y[x:]

reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train, y_train)

modelPred = reg.predict(X_test)

print(modelPred)
print("Number of predictions:",len(modelPred))

meanSquaredError = mean_squared_error(y_test, modelPred)

print("MSE:", meanSquaredError)
print(df1.size)
df2 = df1.iloc[x:, :].copy()


df2['pred'] = modelPred

df2.head()

Output:

[ 267.7     258.26608241  265.07037249 ...,  267.27370169  256.7     272.2 ]
Number of predictions: 87891
MSE: 1954.9271256
6721026

        X_0       pred
170625  48  267.700000
170626  66  258.266082
170627  184 265.070372
170628  259 294.700000
170629  271 281.966667

Upvotes: 2

Related Questions