Sklearn TfIdfVectorizer remove docs containing all stopwords

Question

I am using sklearn's TfIdfVectorizer to vectorize my corpus. In my analysis, there are some document which all terms are filtered out due to containing all stopwords. To reduce the sparsity issue and because it is meaningless to include them in the analysis, I would like to remove it.

Looking into the TfIdfVectorizer doc, there is no parameter that can be set to do this. Therefore, I am thinking of removing this manually before passing the corpus into the vectorizer. However, this has a potential issue which the stopwords that I have gotten is not the same as the list used by vectorizer, since I also use both min_df and max_df option to filter out terms.

Is there any better way to achieve what I am looking for (i.e. removing/ignoring document containing all stopwords)?

Any help would be greatly appreciated.

Sergey Bushmanov · Accepted Answer

You can:

specify your sopwords and then, after TfidfVecorizer
filter out empty rows

The following code snippet shows a simplified example that should set you in the right direction:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["aa ab","aa ab ac"]
stop_words = ["aa","ab"]

tfidf = TfidfVectorizer(stop_words=stop_words)
corpus_tfidf = tfidf.fit_transform(corpus)
idx = np.array(corpus_tfidf.sum(axis=1)==0).ravel()
corpus_filtered = corpus_tfidf[~idx]

Feel free to ask questions if you still have any!

Sklearn TfIdfVectorizer remove docs containing all stopwords

Answers (2)

Related Questions