urukh
urukh

Reputation: 380

Python - NLTK separating punctuation

i'm pretty new to Python, i'm trying to use NLTK to remove stopwords of my file. The code is working, however it's separating punctuation, if my text is a tweet with a mention (@user), i get "@ user". Later i'll need to do a word frequency, and i need mentions and hashtags to be working properly. My code:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import codecs
arquivo = open('newfile.txt', encoding="utf8")
linha = arquivo.readline()
while linha:
    stop_word = set(stopwords.words("portuguese"))
    word_tokens = word_tokenize(linha)
    filtered_sentence = [w for w in word_tokens if not w in stop_word]
    filtered_sentence = []
    for w in word_tokens:
       if w not in stop_word:
           filtered_sentence.append(w)
    fp = codecs.open("stopwords.txt", "a", "utf-8")
    for words in (filtered_sentence):
        fp.write(words + " ")
    fp.write("\n")
    linha= arquivo.readline()

EDIT Not sure if this is the best way to do it, but i fixed it this way:

for words in (filtered_sentence):
        fp.write(words)
        if words not in string.punctuation:
            fp.write(" ")
    fp.write("\n")

Upvotes: 3

Views: 2126

Answers (1)

ewcz
ewcz

Reputation: 13087

instead of word_tokenize, you could use Twitter-aware tokenizer provided by nltk:

from nltk.tokenize import TweetTokenizer

...
tknzr = TweetTokenizer()
...
word_tokens = tknzr.tokenize(linha)

Upvotes: 3

Related Questions