The number of columns of test data and training data is not the same

Question

I'm trying to determine if the News headlines are real or fake.

For this, I'm using 'CountVectorizer' to calculate how many times each word is used in each sentence.

The problem is number of words in the sentences is not the same, so the number of columns of the training set and the test set are not the same.

Therefore, the program doesn't work during the testing phase.

# Vectorized All Data
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentenceInput)

result = pd.DataFrame(data = X.toarray())

So,'result' variable is dependent data.

# Naive Bayes Prediction
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(result,targetInput)

I taught data to the system via target variable.

The problem begins after that point.

# Test Data
X = vectorizer.fit_transform(testList)
print(vectorizer.get_feature_names())
print(X.toarray())

testResult = pd.DataFrame(data = X.toarray())

prediction = nb.predict(testResult)
print(prediction)

I get the following error when I want to print 'prediction' on the screen.

ValueError: operands could not be broadcast together with shapes (489,1828) 
(5273,)

I'm not sure if the problem is exactly what I'm telling.

Sven Harris · Accepted Answer

CountVectorizer doesn't actually care how many words are in each sentence, it's output is a sparse matrix where the columns are the words and the rows are the sentences, where the values are the number of times the word appears in the given sentence i.e. cabbage appears 3 times, bag appears 0 times etc.

To make your data match up you need to use the same CountVectorizer instead of refitting which is currently what happens in your testing phase (using .fit_transform())

Change your testing to simply transform and this part of the problem should go away. This will use the CountVectorizer you trained on all the data and output the values in the form that you've used to create your model.

# Test Data
X = vectorizer.transform(testList)

The number of columns of test data and training data is not the same

Answers (1)

Related Questions