Python - NLTK separating punctuation

Question

i'm pretty new to Python, i'm trying to use NLTK to remove stopwords of my file. The code is working, however it's separating punctuation, if my text is a tweet with a mention (@user), i get "@ user". Later i'll need to do a word frequency, and i need mentions and hashtags to be working properly. My code:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import codecs
arquivo = open('newfile.txt', encoding="utf8")
linha = arquivo.readline()
while linha:
    stop_word = set(stopwords.words("portuguese"))
    word_tokens = word_tokenize(linha)
    filtered_sentence = [w for w in word_tokens if not w in stop_word]
    filtered_sentence = []
    for w in word_tokens:
       if w not in stop_word:
           filtered_sentence.append(w)
    fp = codecs.open("stopwords.txt", "a", "utf-8")
    for words in (filtered_sentence):
        fp.write(words + " ")
    fp.write("
")
    linha= arquivo.readline()

EDIT Not sure if this is the best way to do it, but i fixed it this way:

for words in (filtered_sentence):
        fp.write(words)
        if words not in string.punctuation:
            fp.write(" ")
    fp.write("
")

ewcz · Accepted Answer

instead of word_tokenize, you could use Twitter-aware tokenizer provided by nltk:

from nltk.tokenize import TweetTokenizer

...
tknzr = TweetTokenizer()
...
word_tokens = tknzr.tokenize(linha)

Python - NLTK separating punctuation

Answers (1)

Related Questions