Reputation: 127
I am trying to remove stopwords in French and English in TfidfVectorizer. So far, I've only managed to remove stopwords from the English language. When I try to enter the French language for the stop_words, I get an error message that says it's not built-in.
In fact, I get the following error message:
ValueError: not a built-in stop list: french
I have a text document containing 700 lines of text mixed in French and English.
I am doing a clustering project of these 700 lines using Python. However, a problem arises with my clusters: I am getting clusters full of French stopwords, and this is messing up the efficiency of my clusters.
My question is the following:
Is there any way to add French stopwords or manually update the built-in English stopword list so that I can get rid of these unnecessary words?
Here's the TfidfVectorizer code that contains my stopwords code:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem,
ngram_range=(1,3))
The removal of these French stopwords will allow me to have clusters that are representative of the words that are recurring in my document.
For any doubt regarding the relevance of this question, I had asked a similar question last week. However, it is not similar as it does not use TfidfVectorizer.
Any help would be greatly appreciated. Thank you.
Upvotes: 6
Views: 26594
Reputation: 6659
You can use good stop words packages from NLTK or Spacy, two super popular NLP libraries for Python. Since achultz has already added the snippet for using stop-words library, I will show how to go about with NLTK or Spacy.
from nltk.corpus import stopwords
final_stopwords_list = stopwords.words('english') + stopwords.words('french')
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
max_features=200000,
min_df=0.2,
stop_words=final_stopwords_list,
use_idf=True,
tokenizer=tokenize_and_stem,
ngram_range=(1,3))
NLTK will give you 334 stopwords in total.
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop
final_stopwords_list = list(fr_stop) + list(en_stop)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
max_features=200000,
min_df=0.2,
stop_words=final_stopwords_list,
use_idf=True,
tokenizer=tokenize_and_stem,
ngram_range=(1,3))
Spacy gives you 890 stopwords in total.
Upvotes: 14
Reputation: 406
In my experience, the easiest way to workaround this problem is to manually delete the stopwords in preprocessing stage(while taking list of most common french phrases from elsewhere).
Also, should be handy to check which stopwords are most commonly occuring in english and french in your text/model(either by just their occurencies or idf) and add them to stopwords which you exclude in preprocessing stage.
If you prefer to delete the words using tfidfvectorizer buildin methods, then consider making a list of stopwords that you want to include both french and english and pass them as
stopwords=[a,he,she,le,...]
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words=stopwords,analyzer=’word’,
use_idf=True, tokenizer=tokenize_and_stem)
Important thing is,cite from documentation:
‘english’ is currently the only supported string value
So, for now you will have to manually add some list of stopwords, which you can find anywhere on web and then adjust with your topic, for example: stopwords
Upvotes: 0
Reputation: 1678
Igor Sharm noted ways to do things manually, but perhaps you could also install the stop-words package. Then, the since TfidfVectorizer allows a list as a stop_words parameter,
from stop_words import get_stop_words
my_stop_word_list = get_stop_words('english') + get_stop_words('french')
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words=my_stop_word_list,
use_idf=True, tokenizer=tokenize_and_stem,
ngram_range=(1,3))
You could also read and parse the french.txt file in the github project as needed, if you want to include only some words.
Upvotes: 1