still_learning
still_learning

Reputation: 806

TfidfVectorizer using my own stopwords dictionary

I would like to ask you if I could use my own stopwords dictionary instead of the pre-existing one in TfidfVectorizer. I built a greater dictionary of stop words and I would prefer to use it. However I am having difficulties in including it in the code below (there is shown the standard one, though).

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words_='english')
tfidf = tfidf_vectorizer.fit_transform(df["0"]['Words']) # multiple dataframes

kmeans = KMeans(n_clusters=2).fit(tfidf)

but I got the following error:

    TypeError: __init__() got an unexpected keyword argument 'stop_words_'

Let's say that my dictionary is:

stopwords["a","an", ... "been", "had",...]

How could I include it?

Any help would be greatly appreciated.

Upvotes: 0

Views: 1792

Answers (2)

pitter-patter
pitter-patter

Reputation: 36

TfidfVectorizer does not have a parameter 'stop_words_'.

If you have a custom stop_words list as below:

smart_stoplist = ['a', 'an', 'the']

Use it like this:

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing,stop_words=smart_stoplist)

Upvotes: 1

Ehsan
Ehsan

Reputation: 711

This is a better way for what you are going to do: please note that TfidfVectorizer has a Tokenizer method which accepts cleaned array of words. I thought maybe this would be useful for you!

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
nltk.download(['stopwords'])
# here you can add to stopword_list any other word that you want or define your own array_like stopwords_list
stop_words = stopwords.words('english')

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemmed

tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)

tfidf = tfidf_vectorizer.fit_transform(df['Texts'])

kmeans = KMeans(n_clusters=2).fit(tfidf)

Upvotes: 2

Related Questions