cphill
cphill

Reputation: 5914

Python - SKLearn Fit Array Error

I'm relatively new to using sklearn and python for data analysis and am trying to run some linear regression on a dataset that I loaded from a .csv file.

I have loaded my data into train_test_split without any issues, but when I try to fit my training data I receive an error ValueError: Expected 2D array, got 1D array instead: ... Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample..

Error at model = lm.fit(X_train, y_train)

Because of my freshness with working with these packages, I'm trying to determine if this is the result of not setting my imported csv to a pandas data frame before running the regression or if this has to do with something else.

My CSV is in the format of:

Month,Date,Day of Week,Growth,Sunlight,Plants
7,7/1/17,Saturday,44,611,26
7,7/2/17,Sunday,30,507,14
7,7/5/17,Wednesday,55,994,25
7,7/6/17,Thursday,50,1014,23
7,7/7/17,Friday,78,850,49
7,7/8/17,Saturday,81,551,50
7,7/9/17,Sunday,59,506,29

Here is how I set up the regression:

import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt


organic = pd.read_csv("linear-regression.csv")

organic.columns
Index(['Month', 'Date', 'Day of Week', 'Growth', 'Sunlight', 'Plants'], dtype='object')

# Set the depedent (Growth) and independent (Sunlight)
y = organic['Growth']
X = organic['Sunlight']

# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print (X_train.shape, X_test.shape)
print (y_train.shape, y_test.shape)
(192,) (49,)
(192,) (49,)

lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)

# Error pointing to an array with values from Sunlight [611, 507, 994, ...]

Upvotes: 0

Views: 5756

Answers (3)

Ehsan
Ehsan

Reputation: 67

Once you load the data into train_test_split(X, y, test_size=0.2), it returns Pandas Series X_train and X_test with (192, ) and (49, ) dimensions. As mentioned in the previous answers, sklearn expect matrices of shape [n_samples,n_features] as the X_train, X_test data. You can simply convert the Pandas Series X_train and X_test to Pandas Dataframes to change their dimensions to (192, 1) and (49, 1).

lm = linear_model.LinearRegression()
model = lm.fit(X_train.to_frame(), y_train)

Upvotes: 0

pythonic833
pythonic833

Reputation: 3224

You just need to adjust your last columns to

lm = linear_model.LinearRegression()
model = lm.fit(X_train.values.reshape(-1,1), y_train)

and the model will fit. The reason for this is that the linear model from sklearn expects

X : numpy array or sparse matrix of shape [n_samples,n_features]

So our training data must be of form [7,1] in this particular case

Upvotes: 5

Stev
Stev

Reputation: 1140

You are only using one feature, so it tells you what to do within the error:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature.

The data always has to be 2D in scikit-learn.

(Don't forget the typo in X = organic['Sunglight'])

Upvotes: 1

Related Questions