Neeraj Sharma
Neeraj Sharma

Reputation: 19

Found input variables with inconsistent numbers of samples: [2, 144]

I am having a training data set consisting of 144 feedback with 72 positive and 72 negative respectively. there are two target labels positive and negative respectively. Consider the following code segment :

import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data) 
                     data    target
0      facilitates good student teacher communication.  positive
1                           lectures are very lengthy.  negative
2             the teacher is very good at interaction.  positive
3                       good at clearing the concepts.  positive
4                       good at clearing the concepts.  positive
5                                    good at teaching.  positive
6                          does not shows test copies.  negative
7                           good subjective knowledge.  positive

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data)
X = cv.transform(feedback_data)
X_test = cv.transform(feedback_data_test)

from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i<72 else 0 for i in range(144)]
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)

I do not understand what the problem is. Please help.

Upvotes: 1

Views: 96

Answers (1)

Frayal
Frayal

Reputation: 2161

You are not using the count vectorizer right. This what you have now:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(df)
X = cv.transform(df)
X
<2x2 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>

So you see that you don't achieve what you want. you do not transform each line correctly. You don't even train the count vectorizer right because you use the entire DataFrame and not just the corpus of comments. To solve the issue we need to make sure that the Count is well done: if you do this (Use the right corpus):

cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = cv.transform(df)
X
<2x23 sparse matrix of type '<class 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>

you see that we are coming close to what we want. We just have to transform it right (transform each line):

cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda x: cv.transform([x])).values
X
array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
   ...
       <1x23 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>], dtype=object)

We have a more suitable X! Now we just need to check if we can split:

target = [1 if i<72 else 0 for i in range(8)] # The dataset is here of size 8 
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)

And it works!

You need to be sure you understand what CountVectorizer do to use it the right way

Upvotes: 1

Related Questions