Classification of int and string txt throws ValueError: Number of features of the model must match the input. Model n_features

Question

I am fairly new to Machine Learning, sorry in advance I am trying to read from a txt file which has train samples as such:

123 this is a long text string

325 another text

and my labels.txt file is as such:

123 1

325 2

So after many tries I've managed to read them with pandas:

train_labels = pd.read_csv('train_labels.txt', nrows=200, dtype=str, delimiter="	", header=None)

train_samples = pd.read_csv('train_samples.txt', nrows=200, dtype=str, encoding="UTF-8", delimiter="	", header=None)

And I convert the string column in my train samples with vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stop_words)

X = tfidfconverter.fit_transform(train_samples.iloc[:, 1]).toarray()

Then I try to fit my classificator with random forest

clf = RandomForestClassifier(n_estimators=1000, random_state=0)

clf.fit(X, train_labels) -> error

Then I read samples to calculate my accuracy score

validation_source_samples = pd.read_csv('validation_source_samples.txt', nrows=200, dtype=str, encoding="UTF-8", delimiter="	", header=None)

validation_source_labels = pd.read_csv('validation_source_labels.txt', nrows=200, dtype=str, delimiter="	", header=None)

T = tfidfconverter.fit_transform(validation_source_samples.iloc[:, 1]).toarray()


pred = clf.predict(T)

on clf.predict I get the error:

`ValueError: Number of features of the model must match the input`.

Model n_features is 780 and input n_features is 879

I have searched for answers on this type of error but nothing seemed to match my actual input files and problem. Sorry in advance if it has been answered before.

Classification of int and string txt throws ValueError: Number of features of the model must match the input. Model n_features

Answers (1)

Related Questions