Tripti Agrawal
Tripti Agrawal

Reputation: 15

ValueError: Found input variables with inconsistent numbers of samples: [29675, 9574, 29675]

I am a beginner in ML. The problem is that I have the training and test data in different files and are of different lengths due to which I am getting the following errors:

   Traceback (most recent call last):
   File "C:/Users/Ellen/Desktop/Python/ML_4.py", line 35, in <module>
   X_train, X_test, y_train, y_test = 
   train_test_split(processed_features_train, processed_features_test, 
   labels, test_size=1, random_state=0)
   File "C:\Python\Python37\lib\site- 
   packages\sklearn\model_selection\_split.py", line 2184, in 
   train_test_split
   arrays = indexable(*arrays)
   File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", 
   line 260, in indexable
   check_consistent_length(*result)
   File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py", 
   line 235, in check_consistent_length
   " samples: %r" % [int(l) for l in lengths])
   ValueError: Found input variables with inconsistent numbers of samples: 
   [29675, 9574, 29675]

I don't know how to resolve these errors. Below is my code:

  tweets_train = pd.read_csv('Final.csv')
  features_train = tweets_train.iloc[:, 1].values
  labels= tweets_train.iloc[:, 0].values
  vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
  processed_features_train = 
  vectorizer.fit_transform(features_train).toarray()
  tweets_test = pd.read_csv('DataF1.csv')
  features_test= tweets_test.iloc[:, 1].values.astype('U')  
  vectorizer = CountVectorizer(stop_words=stopwords.words('english')) 
  processed_features_test = 
  vectorizer.fit_transform(features_test).toarray()

  X_train, X_test, y_train, y_test = 
  train_test_split(processed_features_train, processed_features_test, 
  labels, test_size=1, random_state=0)
  text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
  #regr.fit(X_train, y_train)
  text_classifier.fit(X_train, y_train)
  predictions = text_classifier.predict(X_test)
  print(confusion_matrix(y_test,predictions))
  print(classification_report(y_test,predictions))

The line producing error is:X_train, X_test, y_train, y_test = train_test_split(processed_features_train, processed_features_test, labels, test_size=1, random_state=0)

processed_features_train.shape produces output as (29675, 28148) whereas, processed_features_test.shape produces output as (9574, 11526)

The sample data is as follows-(First column is 'labels' and the second column is 'text')

  neutral tap to explore the biggest change to world wars since world war 
  neutral tap to explore the biggest change to sliced bread. 
  negative apple blocked 
  neutral apple applesupport can i have a yawning emoji ? i think i am 
  asking for the 3rd or 5th time 
  neutral apple made with 20  more child labor 
  negative apple is not she the one who said she hates americans ? 

There are only 3 labels (Positive, Negative, Neutral) in train data file and test data file.

Upvotes: 0

Views: 5141

Answers (3)

Gabriel Arghire
Gabriel Arghire

Reputation: 2360

Make sure the number of samples is equal to the number of labels

I had the same error and found to be because the number of samples was not equal to the number of labels.

More specific, I had this code

clf = MultinomialNB().fit(X_train, Y_train)

And the size of X_train was not equal to Y_train.
Then, I reviewed my code and fixed the mistake.

Upvotes: 1

Ben Reiniger
Ben Reiniger

Reputation: 12592

  1. Since your test set is in a separate file, there's no need to split the data (unless you want a validation set, or the test set is in the sense of competitions, unlabelled).

  2. You shouldn't fit a new Vectorizer on the test data; doing so means there is no connection between the columns in the training and testing sets. Instead, use vectorizer.transform(features_test) (with the same object vectorizer that you fit_transformed the training data).

So, try:

tweets_train = pd.read_csv('Final.csv')    
features_train = tweets_train.iloc[:, 1].values 
labels_train = tweets_train.iloc[:, 0].values
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_train = vectorizer.fit_transform(features_train).toarray() 
tweets_test = pd.read_csv('DataF1.csv')
features_test= tweets_test.iloc[:, 1].values.astype('U')
labels_test = tweets_test.iloc[:, 0].values
processed_features_test = vectorizer.transform(features_test).toarray() 

text_classifier = RandomForestClassifier(n_estimators=200, random_state=0) 
text_classifier.fit(processed_features_train, labels_train) 
predictions = text_classifier.predict(processed_features_test)
print(confusion_matrix(labels_test,predictions))
print(classification_report(labels_test,predictions))

Upvotes: 1

Benj Cabalona Jr.
Benj Cabalona Jr.

Reputation: 439

It's because you're passing three datasets into train_test_split, instead of just X, y as it's argument.

Upvotes: 0

Related Questions