Reputation: 11
I am fairly new to Machine Learning, sorry in advance I am trying to read from a txt file which has train samples as such:
123 this is a long text string
325 another text
and my labels.txt file is as such:
123 1
325 2
So after many tries I've managed to read them with pandas:
train_labels = pd.read_csv('train_labels.txt', nrows=200, dtype=str, delimiter="\t", header=None)
train_samples = pd.read_csv('train_samples.txt', nrows=200, dtype=str, encoding="UTF-8", delimiter="\t", header=None)
And I convert the string column in my train samples with vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stop_words)
X = tfidfconverter.fit_transform(train_samples.iloc[:, 1]).toarray()
Then I try to fit my classificator with random forest
clf = RandomForestClassifier(n_estimators=1000, random_state=0)
clf.fit(X, train_labels) -> error
Then I read samples to calculate my accuracy score
validation_source_samples = pd.read_csv('validation_source_samples.txt', nrows=200, dtype=str, encoding="UTF-8", delimiter="\t", header=None)
validation_source_labels = pd.read_csv('validation_source_labels.txt', nrows=200, dtype=str, delimiter="\t", header=None)
T = tfidfconverter.fit_transform(validation_source_samples.iloc[:, 1]).toarray()
pred = clf.predict(T)
on clf.predict
I get the error:
`ValueError: Number of features of the model must match the input`.
Model n_features is 780 and input n_features is 879
I have searched for answers on this type of error but nothing seemed to match my actual input files and problem. Sorry in advance if it has been answered before.
Upvotes: 0
Views: 27
Reputation: 10824
It's because you fit the vectorizer again on the validation data, while the model learned the fit vectorizer from the train data, you can fix it by changing the fit_transform
on the validation line to transform
like this:
T = tfidfconverter.transform(validation_source_samples.iloc[:, 1]).toarray()
Upvotes: 0