Reputation: 3
Quick question:
I'm using string
and nltk.stopwords
to strip a block of text of all its punctuation and stopwords as part of data pre-processing before feeding it into some natural language processing algorithms.
I've tested each component separately on a couple blocks of raw text because I'm still getting used to this process, and it seemed fine.
def text_process(text):
"""
Takes in string of text, and does following operations:
1. Removes punctuation.
2. Removes stopwords.
3. Returns a list of cleaned "tokenized" text.
"""
nopunc = [char for char in text.lower() if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word not in
stopwords.words('english')]
However, when I apply this function to the text column of my dataframe – it's text from a bunch of Pitchfork reviews – I can see that the punctuation isn't actually being removed, although the stopwords are.
Unprocessed:
pitchfork['content'].head(5)
0 “Trip-hop” eventually became a ’90s punchline,...
1 Eight years, five albums, and two EPs in, the ...
2 Minneapolis’ Uranium Club seem to revel in bei...
3 Minneapolis’ Uranium Club seem to revel in bei...
4 Kleenex began with a crash. It transpired one ...
Name: content, dtype: object
Processed:
pitchfork['content'].head(5).apply(text_process)
0 [“triphop”, eventually, became, ’90s, punchlin...
1 [eight, years, five, albums, two, eps, new, yo...
2 [minneapolis’, uranium, club, seem, revel, agg...
3 [minneapolis’, uranium, club, seem, revel, agg...
4 [kleenex, began, crash, it, transpired, one, n...
Name: content, dtype: object
Any thoughts on what's going wrong here? I've looked through the documentation, and I haven't seen anyone who's struggling with this problem in the exact same manner, so I'd love some insight on how to tackle this. Thanks so much!
Upvotes: 0
Views: 1202
Reputation: 10090
The problem here is that utf-8 has different encodings for left and right quotation marks (single and double), rather than just the regular quotation mark that is included in string.punctuation
.
I would do something like
punctuation = [ c for c in string.punctuation ] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019']
nopunc = [ char for char in text.decode('utf-8').lower() if char not in punctuation ]
this adds the utf-8 values for the non-ascii quotation marks to a list called punctuation
, and then decodes the text to utf-8
, and replaces those values.
note: this is python2, if you're using python3, the formatting of the utf values will likely be slightly different
Upvotes: 2