RahMod
RahMod

Reputation: 31

Python: How can we match values of predicted and truth values of a regression model

We are trying to plot the predicted values and truth values on the same graph after fitting a model to predict a truth value using a RandomForestRegressor in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following

t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10

Here is how we do the prediction.

import pandas as pd
import numpy as np
import glob, os
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
from sklearn.cross_validation import train_test_split

df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "data*.csv"))))

for i in range(1,10):
    df['X_t'+str(i)] = df['X'].shift(i)

print(df)

df.dropna(inplace=True)

X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(10)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)


reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train,y_train)


modelPred_test = reg.predict(X_test)

print(modelPred_test)

For comparison, we wish to generate a plot before prediction and after prediction. For the truth value, we tried it with

fig, ax = plt.subplots()
ax.plot(df['time'].values, df['Y'].values)

We wish to plot (in the same graph) the ground truth (time as x-axis and the value of Y as y-axis. When we do

ax.plot(df['time'].values, modelPred_test)

We are getting the following error.

    raise ValueError("x and y must have same first dimension")

ValueError: x and y must have same first dimension

This means that we have less prediction values than we have time stamps in our dataset. To verify this, I did print(df['time'].values.shape) and print(modelPred_test.shape) - and it outputs (258523,) and (103410,) respectively. How can we match which of my time values correspond to the prediction values, then i can use a subset of the time values for my plot command?

Upvotes: 1

Views: 5078

Answers (2)

Desta Haileselassie Hagos
Desta Haileselassie Hagos

Reputation: 26176

You have to set your data like the following.

X = df.drop('Y', axis=1)
y = df['Y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
X_train = X_train.drop('time', axis=1)
X_test = X_test.drop('time', axis=1)

and then sort the datasets

index_values=range(0,len(y_test))
y_test.sort_index(inplace=True)
X_test.sort_index(inplace=True)
modelPred_test = reg.predict(X_test)
ax.plot(pd.Series(index_values), y_test.values)

finally, do the same plot for the predicted values of y. Hope this helps.

Upvotes: 1

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

You need to keep track of the indices for training and testing datasets. For example, you could define

train_index, test_index = train_test_split(df.index, test_size=0.40)

and then X_train = X[train_index], etc.

Then, you could plot the results via ax.plot(df['time'][test_index].values, modelPred_test[df.index == test_index]).

Upvotes: 0

Related Questions