cafedona
cafedona

Reputation: 11

Classification of int and string txt throws ValueError: Number of features of the model must match the input. Model n_features

I am fairly new to Machine Learning, sorry in advance I am trying to read from a txt file which has train samples as such:

123 this is a long text string

325 another text

and my labels.txt file is as such:

123 1

325 2

So after many tries I've managed to read them with pandas:

train_labels = pd.read_csv('train_labels.txt', nrows=200, dtype=str, delimiter="\t", header=None)

train_samples = pd.read_csv('train_samples.txt', nrows=200, dtype=str, encoding="UTF-8", delimiter="\t", header=None)

And I convert the string column in my train samples with vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stop_words)

X = tfidfconverter.fit_transform(train_samples.iloc[:, 1]).toarray()

Then I try to fit my classificator with random forest

clf = RandomForestClassifier(n_estimators=1000, random_state=0)

clf.fit(X, train_labels) -> error

Then I read samples to calculate my accuracy score

validation_source_samples = pd.read_csv('validation_source_samples.txt', nrows=200, dtype=str, encoding="UTF-8", delimiter="\t", header=None)

validation_source_labels = pd.read_csv('validation_source_labels.txt', nrows=200, dtype=str, delimiter="\t", header=None)

T = tfidfconverter.fit_transform(validation_source_samples.iloc[:, 1]).toarray()


pred = clf.predict(T)

on clf.predict I get the error:

`ValueError: Number of features of the model must match the input`. 

Model n_features is 780 and input n_features is 879

I have searched for answers on this type of error but nothing seemed to match my actual input files and problem. Sorry in advance if it has been answered before.

Upvotes: 0

Views: 27

Answers (1)

Damzaky
Damzaky

Reputation: 10824

It's because you fit the vectorizer again on the validation data, while the model learned the fit vectorizer from the train data, you can fix it by changing the fit_transform on the validation line to transform like this:

T = tfidfconverter.transform(validation_source_samples.iloc[:, 1]).toarray()

Upvotes: 0

Related Questions