Reputation: 1
I have trained a TF-IDF model on a specific corpus. This corpus is a set of string, which has been cleaned. I have gotten rid of stopwords, numbers, done some stemming, etc.
The TF-IDF is trained on the CLEANED CORPUS.
However, when I look at the words that are in the TF-IDF, there are words in there which are not in the CLEANED CORPUS.
corpus = clean_corpus(corpus)
vectorizer = TfidfVectorizer(max_df=0.5)
vec_trained = vectorizer.fit_transform(clean_corpus)
keywords_tf_idf = pd.DataFrame(vectorizer.get_feature_names_out()).values.tolist()
counter = 0
list_words = []
for sentence in all_sentences:
words = sentence.split()
for word in words:
list_words.append(word)
for keyword in keywords_tf_idf:
if keyword[0] not in list_words:
counter += 1
print(keyword)
print(counter)
Counter ends up being something in the hundreds (the tf-idf has 17,000+ words).
Does vectorizer.fit_transform() edit words by itself? Is there some weird utf-8 encoding thing going on which vectorizer.fit_transfor() overwrites?
I do not understand how words which are not in the corpus the model was trained on, show up in the model.
Thanks for your help.
Upvotes: 0
Views: 202