bodruk
bodruk

Reputation: 3262

Can't predict values using Linear Regression

there!

I'm studying the IBM Data Science course by Coursera and I'm trying to create some snippets to practice. I've created the following code:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Import and format the dataframes
ibov = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ibov.csv')
ifix = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ifix.csv')
ibov['DATA'] = pd.to_datetime(ibov['DATA'], format='%d/%m/%Y')
ifix['DATA'] = pd.to_datetime(ifix['DATA'], format='%d/%m/%Y')
ifix = ifix.sort_values(by='DATA', ascending=False)
ibov = ibov.sort_values(by='DATA', ascending=False)
ibov = ibov[['DATA','FECHAMENTO']]
ibov.rename(columns={'FECHAMENTO':'IBOV'}, inplace=True)
ifix = ifix[['DATA','FECHAMENTO']]
ifix.rename(columns={'FECHAMENTO':'IFIX'}, inplace=True)

# Merge datasets 
df_idx = ibov.merge( ifix, how='left', on='DATA')
df_idx.set_index('DATA', inplace=True)
df_idx.head()

# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)

# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = np.array([x_train])
y_train = np.array([y_train])
x_test = np.array([x_test])
y_test = np.array([y_test])

# Plot the result
regr.fit(x_train, y_train)
y_pred = regr.predict(y_train)
plt.scatter(x_train, y_train)
plt.plot(x_test, y_pred, color='blue', linewidth=3) # This line produces no result

I experienced some issues with the output values returned by the train_test_split() method. So I converted them to Numpy arrays, then my code worked. I can plot my scatter plot normally, but I can't plot my prediction line.

Running this code on my IBM Data Cloud Notebook produces the following warning:

/opt/conda/envs/Python36/lib/python3.6/site-packages/matplotlib/axes/_base.py:380: MatplotlibDeprecationWarning: cycling among columns of inputs with non-matching shapes is deprecated. cbook.warn_deprecated("2.2", "cycling among columns of inputs "

I searched on Google and here on StackOverflow, but I can't figure what is wrong.

I'll appreciate some assistance. Thanks in advance!

Upvotes: 1

Views: 165

Answers (1)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25189

There are several issues in your code, like y_pred = regr.predict(y_train) and the way you draw a line.

The following code snippet should set you in the right direction:

# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)

# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values

# Plot the result
plt.scatter(x_train, y_train)

regr.fit(x_train.reshape(-1,1), y_train)
idx = np.argsort(x_train)
y_pred = regr.predict(x_train[idx].reshape(-1,1))
plt.plot(x_train[idx], y_pred, color='blue', linewidth=3);

enter image description here

To do the same for the test subset with already fitted model:

# Plot the result
plt.scatter(x_test, y_test)
idx = np.argsort(x_test)
y_pred = regr.predict(x_test[idx].reshape(-1,1))
plt.plot(x_test[idx], y_pred, color='blue', linewidth=3);

enter image description here

Feel free to ask questions if you have any.

Upvotes: 1

Related Questions