Darren Christopher
Darren Christopher

Reputation: 4779

Sklearn TfIdfVectorizer remove docs containing all stopwords

I am using sklearn's TfIdfVectorizer to vectorize my corpus. In my analysis, there are some document which all terms are filtered out due to containing all stopwords. To reduce the sparsity issue and because it is meaningless to include them in the analysis, I would like to remove it.

Looking into the TfIdfVectorizer doc, there is no parameter that can be set to do this. Therefore, I am thinking of removing this manually before passing the corpus into the vectorizer. However, this has a potential issue which the stopwords that I have gotten is not the same as the list used by vectorizer, since I also use both min_df and max_df option to filter out terms.

Is there any better way to achieve what I am looking for (i.e. removing/ignoring document containing all stopwords)?

Any help would be greatly appreciated.

Upvotes: 2

Views: 2648

Answers (2)

Kartikey Singh
Kartikey Singh

Reputation: 892

So, you can use this:

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

def tokenize(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    punctuations="?:!.,;'�۪"
    for token in tokens:
        if token in punctuations:
            tokens.remove(token)
        if re.search('[a-zA-Z0-9]', token):
            filtered_tokens.append(token)

    st = ' '.join(filtered_tokens)
    return st
tokenize(data)

tfidf_vectorizer = TfidfVectorizer(max_df=0.8,min_df=0.01,stop_words='english',
    use_idf=True,tokenizer=tokenize)

tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
ids = np.array(tfidf_matrix.sum(axis=1)==0).ravel()
tfidf_filtered = tfidf_matrix[~ids]

This way you can remove stopwords, empty rows and use min_df and max_df.

Upvotes: 0

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25189

You can:

  1. specify your sopwords and then, after TfidfVecorizer
  2. filter out empty rows

The following code snippet shows a simplified example that should set you in the right direction:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["aa ab","aa ab ac"]
stop_words = ["aa","ab"]

tfidf = TfidfVectorizer(stop_words=stop_words)
corpus_tfidf = tfidf.fit_transform(corpus)
idx = np.array(corpus_tfidf.sum(axis=1)==0).ravel()
corpus_filtered = corpus_tfidf[~idx]

Feel free to ask questions if you still have any!

Upvotes: 2

Related Questions