mrantry
mrantry

Reputation: 13

Stripping Punctuation from Python String

I seem to be having a bit of an issue stripping punctuation from a string in Python. Here, I'm given a text file (specifically a book from Project Gutenberg) and a list of stopwords. I want to return a dictionary of the 10 most commonly used words. Unfortunately, I keep getting one hiccup in my returned dictionary.

import sys
import collections
from string import punctuation
import operator

#should return a string without punctuation
def strip_punc(s):
    return ''.join(c for c in s if c not in punctuation)

def word_cloud(infile, stopwordsfile):

    wordcount = {}

    #Reads the stopwords into a list
    stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]


    #reads data from the text file into a list
    lines = []
    with open(infile) as f:
        lines = f.readlines()
        lines = [line.split() for line in lines]

    #does the wordcount
    for line in lines:
        for word in line:
            word = strip_punc(word).lower()
            if word not in stopwords:
                if word not in wordcount:
                    wordcount[word] = 1
                else:
                    wordcount[word] += 1

    #sorts the dictionary, grabs 10 most common words
    output = dict(sorted(wordcount.items(),
                  key=operator.itemgetter(1), reverse=True)[:10])

    print(output)


if __name__=='__main__':

    try:

        word_cloud(sys.argv[1], sys.argv[2])

    except Exception as e:

        print('An exception has occured:')
        print(e)
        print('Try running as python3 word_cloud.py <input-text> <stopwords>')

This will print out

{'said': 659, 'mr': 606, 'one': 418, '“i': 416, 'lorry': 322, 'upon': 288, 'will': 276, 'defarge': 268, 'man': 264, 'little': 263}

The "i shouldn't be there. I don't understand why it isn't eliminated in my helper function.

Thanks in advance.

Upvotes: 1

Views: 1229

Answers (4)

Daniel Corin
Daniel Corin

Reputation: 2067

The character is not ".

string.punctuation only includes the following ASCII characters:

In [1]: import string

In [2]: string.punctuation
Out[2]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

so you will need to augment the list of characters you are stripping.

Something like the following should accomplish what you need:

extended_punc = punctuation + '“' #  and any other characters you need to strip

def strip_punc(s):
    return ''.join(c for c in s if c not in extended_punc)

Alternatively, you could use the package unidecode to ASCII-fy your text and not worry about creating a list of unicode characters you may need to handle:

from unidecode import unidecode

def strip_punc(s):
    s = unidecode(s.decode('utf-8'))
    return ''.join(c for c in s if c not in punctuation).encode('utf-8')

Upvotes: 5

lenz
lenz

Reputation: 5828

As stated in other answers, the problem is that string.punctuation only contains ASCII characters, so the typographical ("fancy") quotes like are missing, among many other.

You could replace your strip_punc function with the following:

def strip_punc(s):
    '''
    Remove all punctuation characters.
    '''
    return re.sub(r'[^\w\s]', '', s)

This approach uses the re module. The regular expression works as follows: It matches any character that is neither alphanumeric (\w) nor whitespace (\s) and replaces it with the empty string (ie. deletes it).

This solution takes advantage of the fact that the "special sequences" \w and \s are unicode-aware, ie. they work equally well for any characters of any script, not only ASCII:

>>> strip_punc("I said “naïve”, didn't I!")
'I said naïve didnt I'

Please note that \w includes the underscore (_), because it is considered "alphanumeric". If you want to strip it as well, change the pattern to:

r'[^\w\s]|_'

Upvotes: 1

Adam
Adam

Reputation: 4172

I'd change my logic up on the strip_punc function

from string import asci_letters

def strip_punc(word):
    return ''.join(c for c in word if c in ascii_letters)

This logic is an explicit allow vs an explicit deny which means you are only allowing in the values you want vs only blocking the values you know you don't want i.e. leaves out any edge cases you didn't think about.

Also note this. Best way to strip punctuation from a string in Python

Upvotes: 0

Luis Miguel
Luis Miguel

Reputation: 5137

w/o knowing what is in the stopwords list, the fastest solution is to add this:

#Reads the stopwords into a list
stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]
stopwords.append('“i')

And continue with the rest of your code..

Upvotes: 0

Related Questions