Isaac Nikolai Fox
Isaac Nikolai Fox

Reputation: 3

Error when using string.punctuation to remove punctuation for a string

Quick question:

I'm using string and nltk.stopwords to strip a block of text of all its punctuation and stopwords as part of data pre-processing before feeding it into some natural language processing algorithms.

I've tested each component separately on a couple blocks of raw text because I'm still getting used to this process, and it seemed fine.

    def text_process(text):
        """
        Takes in string of text, and does following operations: 
        1. Removes punctuation. 
        2. Removes stopwords. 
        3. Returns a list of cleaned "tokenized" text.
        """
        nopunc = [char for char in text.lower() if char not in string.punctuation]

        nopunc = ''.join(nopunc)

        return [word for word in nopunc.split() if word not in 
               stopwords.words('english')]

However, when I apply this function to the text column of my dataframe – it's text from a bunch of Pitchfork reviews – I can see that the punctuation isn't actually being removed, although the stopwords are.

Unprocessed:

    pitchfork['content'].head(5)

0    “Trip-hop” eventually became a ’90s punchline,...
1    Eight years, five albums, and two EPs in, the ...
2    Minneapolis’ Uranium Club seem to revel in bei...
3    Minneapolis’ Uranium Club seem to revel in bei...
4    Kleenex began with a crash. It transpired one ...
Name: content, dtype: object

Processed:

    pitchfork['content'].head(5).apply(text_process)


0    [“triphop”, eventually, became, ’90s, punchlin...
1    [eight, years, five, albums, two, eps, new, yo...
2    [minneapolis’, uranium, club, seem, revel, agg...
3    [minneapolis’, uranium, club, seem, revel, agg...
4    [kleenex, began, crash, it, transpired, one, n...
Name: content, dtype: object

Any thoughts on what's going wrong here? I've looked through the documentation, and I haven't seen anyone who's struggling with this problem in the exact same manner, so I'd love some insight on how to tackle this. Thanks so much!

Upvotes: 0

Views: 1202

Answers (1)

wpercy
wpercy

Reputation: 10090

The problem here is that utf-8 has different encodings for left and right quotation marks (single and double), rather than just the regular quotation mark that is included in string.punctuation.

I would do something like

punctuation = [ c for c in string.punctuation ] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019']

nopunc = [ char for char in text.decode('utf-8').lower() if char not in punctuation ]

this adds the utf-8 values for the non-ascii quotation marks to a list called punctuation, and then decodes the text to utf-8, and replaces those values.

note: this is python2, if you're using python3, the formatting of the utf values will likely be slightly different

Upvotes: 2

Related Questions