Karthik Bhojaraj
Karthik Bhojaraj

Reputation: 145

How can I fix "ValueError: Expected 2D array, got 1D array instead" in scikit-learn/Python?

I just started with the machine learning with a simple example to try and learn. So, I want to classify the files in my disk based on the file type by making use of a classifier. The code I have written is,

import sklearn
import numpy as np


# Importing a local data set from the desktop
import pandas as pd
mydata = pd.read_csv('file_format.csv',skipinitialspace=True)
print mydata


x_train = mydata.script
y_train = mydata.label

#print x_train
#print y_train
x_test = mydata.script

from sklearn import tree
classi = tree.DecisionTreeClassifier()

classi.fit(x_train, y_train)

predictions = classi.predict(x_test)
print predictions

And I am getting the error as,

  script  class  div   label
0       5      6    7    html
1       0      0    0  python
2       1      1    1     csv
Traceback (most recent call last):
  File "newtest.py", line 21, in <module>
  classi.fit(x_train, y_train)
  File "/home/initiouser2/.local/lib/python2.7/site-
packages/sklearn/tree/tree.py", line 790, in fit
    X_idx_sorted=X_idx_sorted)
  File "/home/initiouser2/.local/lib/python2.7/site-
packages/sklearn/tree/tree.py", line 116, in fit
    X = check_array(X, dtype=DTYPE, accept_sparse="csc")
  File "/home/initiouser2/.local/lib/python2.7/site-
packages/sklearn/utils/validation.py", line 410, in check_array
    "if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 5.  0.  1.].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.

How can I fix this problem?

Upvotes: 7

Views: 96712

Answers (6)

Gaurav Singh Rathore
Gaurav Singh Rathore

Reputation: 11

You have to create a two-dimensional array.

You might be giving input like this:

model.predict([1, 2, 0, 4])

But this is wrong.

You have to give input like this:

model.predict([[1,2,0,4]])

There are two square brackets, not one.

Upvotes: 1

codexaxor
codexaxor

Reputation: 97

Suppose initially you have,

X = dataset.iloc[:, 1].values

which indicates you have the first column. including all the rows. Now make it as the following:

X = dataset.iloc[:, 1:2].values

Here 1:2 means [1,2) similar to the upper bound formation.

Upvotes: 0

Nabreezy
Nabreezy

Reputation: 41

A simple solution that reshapes it automatically is instead of using:

X = dataset.iloc[:, 0].values

You can use:

X = dataset.iloc[:, :-1].values

that is, if you only have two columns, and you are trying to get the first one, the code gets all the column, except the last one.

Upvotes: 0

Ameya Marathe
Ameya Marathe

Reputation: 221

Use:

X = dataset.iloc[:, 0].values
y = dataset.iloc[:, 1].values

regressor = LinearRegression()
X = X.reshape(-1, 1)
regressor.fit(X, y)

I had the following code. The reshape operator is not an inplace operator. So we have to replace its value by the value after reshaping like given above.

Upvotes: 4

sameer_nubia
sameer_nubia

Reputation: 811

Easy while selecting column make it 2 d.

x_train = mydata[['script']]
y_train = mydata[['label']]

Upvotes: 1

cs95
cs95

Reputation: 402263

When passing your input to the classifiers, pass 2D arrays (of shape (M, N) where N >= 1), not 1D arrays (which have shape (N,)). The error message is pretty clear,

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

from sklearn.model_selection import train_test_split

# X.shape should be (N, M) where M >= 1
X = mydata[['script']]  
# y.shape should be (N, 1)
y = mydata['label'] 
# perform label encoding if "label" contains strings
# y = pd.factorize(mydata['label'])[0].reshape(-1, 1) 
X_train, X_test, y_train, y_test = train_test_split(
                      X, y, test_size=0.33, random_state=42)
...

clf.fit(X_train, y_train) 
print(clf.score(X_test, y_test))

Some other helpful tips -

  1. split your data into valid train and test portions. Do not use your training data to test - that leads to inaccurate estimations of your classifier's strength
  2. I'd recommend factorizing your labels, so you're dealing with integers. It's just easier.

Upvotes: 21

Related Questions