Reputation: 15
I am a beginner in ML. The problem is that I have the training and test data in different files and are of different lengths due to which I am getting the following errors:
Traceback (most recent call last):
File "C:/Users/Ellen/Desktop/Python/ML_4.py", line 35, in <module>
X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
File "C:\Python\Python37\lib\site-
packages\sklearn\model_selection\_split.py", line 2184, in
train_test_split
arrays = indexable(*arrays)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py",
line 260, in indexable
check_consistent_length(*result)
File "C:\Python\Python37\lib\site-packages\sklearn\utils\validation.py",
line 235, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples:
[29675, 9574, 29675]
I don't know how to resolve these errors. Below is my code:
tweets_train = pd.read_csv('Final.csv')
features_train = tweets_train.iloc[:, 1].values
labels= tweets_train.iloc[:, 0].values
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_train =
vectorizer.fit_transform(features_train).toarray()
tweets_test = pd.read_csv('DataF1.csv')
features_test= tweets_test.iloc[:, 1].values.astype('U')
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_test =
vectorizer.fit_transform(features_test).toarray()
X_train, X_test, y_train, y_test =
train_test_split(processed_features_train, processed_features_test,
labels, test_size=1, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
#regr.fit(X_train, y_train)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
The line producing error is:X_train, X_test, y_train, y_test = train_test_split(processed_features_train, processed_features_test, labels, test_size=1, random_state=0)
processed_features_train.shape produces output as (29675, 28148) whereas, processed_features_test.shape produces output as (9574, 11526)
The sample data is as follows-(First column is 'labels' and the second column is 'text')
neutral tap to explore the biggest change to world wars since world war
neutral tap to explore the biggest change to sliced bread.
negative apple blocked
neutral apple applesupport can i have a yawning emoji ? i think i am
asking for the 3rd or 5th time
neutral apple made with 20 more child labor
negative apple is not she the one who said she hates americans ?
There are only 3 labels (Positive, Negative, Neutral) in train data file and test data file.
Upvotes: 0
Views: 5141
Reputation: 2360
I had the same error and found to be because the number of samples was not equal to the number of labels.
More specific, I had this code
clf = MultinomialNB().fit(X_train, Y_train)
And the size of X_train was not equal to Y_train.
Then, I reviewed my code and fixed the mistake.
Upvotes: 1
Reputation: 12592
Since your test set is in a separate file, there's no need to split the data (unless you want a validation set, or the test set is in the sense of competitions, unlabelled).
You shouldn't fit a new Vectorizer on the test data; doing so means there is no connection between the columns in the training and testing sets. Instead, use vectorizer.transform(features_test)
(with the same object vectorizer
that you fit_transform
ed the training data).
So, try:
tweets_train = pd.read_csv('Final.csv')
features_train = tweets_train.iloc[:, 1].values
labels_train = tweets_train.iloc[:, 0].values
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
processed_features_train = vectorizer.fit_transform(features_train).toarray()
tweets_test = pd.read_csv('DataF1.csv')
features_test= tweets_test.iloc[:, 1].values.astype('U')
labels_test = tweets_test.iloc[:, 0].values
processed_features_test = vectorizer.transform(features_test).toarray()
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(processed_features_train, labels_train)
predictions = text_classifier.predict(processed_features_test)
print(confusion_matrix(labels_test,predictions))
print(classification_report(labels_test,predictions))
Upvotes: 1
Reputation: 439
It's because you're passing three datasets into train_test_split
, instead of just X, y
as it's argument.
Upvotes: 0