Dino
Dino

Reputation: 1

TFIDF model created by TfidfVectorizer contains words which are not in the corpus it was trained on

I have trained a TF-IDF model on a specific corpus. This corpus is a set of string, which has been cleaned. I have gotten rid of stopwords, numbers, done some stemming, etc.

The TF-IDF is trained on the CLEANED CORPUS.

However, when I look at the words that are in the TF-IDF, there are words in there which are not in the CLEANED CORPUS.

corpus = clean_corpus(corpus)
vectorizer = TfidfVectorizer(max_df=0.5)
vec_trained = vectorizer.fit_transform(clean_corpus)

keywords_tf_idf = pd.DataFrame(vectorizer.get_feature_names_out()).values.tolist()

counter = 0
list_words = []
for sentence in all_sentences:
    words = sentence.split()
    for word in words:
        list_words.append(word)

for keyword in keywords_tf_idf:
    if keyword[0] not in list_words:
        counter += 1
        print(keyword)
print(counter)

Counter ends up being something in the hundreds (the tf-idf has 17,000+ words).

Does vectorizer.fit_transform() edit words by itself? Is there some weird utf-8 encoding thing going on which vectorizer.fit_transfor() overwrites?

I do not understand how words which are not in the corpus the model was trained on, show up in the model.

Thanks for your help.

Upvotes: 0

Views: 202

Answers (0)

Related Questions