Reputation: 33
i am using stopwords and sentence tokenizer but when i print filtered sentence that gives me result including stopwords. the problem is it not ignore stopwords in output . how to remove stopwords in sentence tokenizer ?
userinput1 = input ("Enter file name:")
myfile1 = open(userinput1).read()
stop_words = set(stopwords.words("english"))
word1 = nltk.sent_tokenize(myfile1)
filtration_sentence = []
for w in word1:
word = sent_tokenize(myfile1)
filtered_sentence = [w for w in word if not w in stop_words]
print(filtered_sentence)
userinput2 = input ("Enter file name:")
myfile2 = open(userinput2).read()
stop_words = set(stopwords.words("english"))
word2 = nltk.sent_tokenize(myfile2)
filtration_sentence = []
for w in word2:
word = sent_tokenize(myfile2)
filtered_sentence = [w for w in word if not w in stop_words]
print(filtered_sentence)
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
'''remove punctuation, lowercase, stem'''
def normalize(text):
return stem_tokens(nltk.sent_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(myfile1, myfile2):
tfidf = vectorizer.fit_transform([myfile1, myfile2])
return ((tfidf * tfidf.T).A)[0,1]
print(cosine_sim(myfile1,myfile2))
Upvotes: 0
Views: 2226
Reputation: 5389
I think you cannot directly remove stopwords
from the sentence. You have to split each words in sentences out first or using nltk.word_tokenize
to split your sentences. The for each words, you check if it's in the stop words list. Here is an example:
import nltk
from nltk.corpus import stopwords
stopwords_en = set(stopwords.words('english'))
sents = nltk.sent_tokenize("This is an example sentence. We will remove stop words from this")
sents_rm_stopwords = []
for sent in sents:
sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w.lower() not in stopwords_en))
Output
['example sentence .', 'remove stop words']
note that you can also remove punctuation using string.punctuation
.
import string
stopwords_punctuation = stopwords_en.union(string.punctuation) # merge set together
Upvotes: 1