Reputation: 13
I seem to be having a bit of an issue stripping punctuation from a string in Python. Here, I'm given a text file (specifically a book from Project Gutenberg) and a list of stopwords. I want to return a dictionary of the 10 most commonly used words. Unfortunately, I keep getting one hiccup in my returned dictionary.
import sys
import collections
from string import punctuation
import operator
#should return a string without punctuation
def strip_punc(s):
return ''.join(c for c in s if c not in punctuation)
def word_cloud(infile, stopwordsfile):
wordcount = {}
#Reads the stopwords into a list
stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]
#reads data from the text file into a list
lines = []
with open(infile) as f:
lines = f.readlines()
lines = [line.split() for line in lines]
#does the wordcount
for line in lines:
for word in line:
word = strip_punc(word).lower()
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
#sorts the dictionary, grabs 10 most common words
output = dict(sorted(wordcount.items(),
key=operator.itemgetter(1), reverse=True)[:10])
print(output)
if __name__=='__main__':
try:
word_cloud(sys.argv[1], sys.argv[2])
except Exception as e:
print('An exception has occured:')
print(e)
print('Try running as python3 word_cloud.py <input-text> <stopwords>')
This will print out
{'said': 659, 'mr': 606, 'one': 418, '“i': 416, 'lorry': 322, 'upon': 288, 'will': 276, 'defarge': 268, 'man': 264, 'little': 263}
The "i shouldn't be there. I don't understand why it isn't eliminated in my helper function.
Thanks in advance.
Upvotes: 1
Views: 1229
Reputation: 2067
The character “
is not "
.
string.punctuation
only includes the following ASCII characters:
In [1]: import string
In [2]: string.punctuation
Out[2]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
so you will need to augment the list of characters you are stripping.
Something like the following should accomplish what you need:
extended_punc = punctuation + '“' # and any other characters you need to strip
def strip_punc(s):
return ''.join(c for c in s if c not in extended_punc)
Alternatively, you could use the package unidecode
to ASCII-fy your text and not worry about creating a list of unicode characters you may need to handle:
from unidecode import unidecode
def strip_punc(s):
s = unidecode(s.decode('utf-8'))
return ''.join(c for c in s if c not in punctuation).encode('utf-8')
Upvotes: 5
Reputation: 5828
As stated in other answers, the problem is that string.punctuation
only contains ASCII characters, so the typographical ("fancy") quotes like “
are missing, among many other.
You could replace your strip_punc
function with the following:
def strip_punc(s):
'''
Remove all punctuation characters.
'''
return re.sub(r'[^\w\s]', '', s)
This approach uses the re
module.
The regular expression works as follows:
It matches any character that is neither alphanumeric (\w
) nor whitespace (\s
) and replaces it with the empty string (ie. deletes it).
This solution takes advantage of the fact that the "special sequences" \w
and \s
are unicode-aware, ie. they work equally well for any characters of any script, not only ASCII:
>>> strip_punc("I said “naïve”, didn't I!")
'I said naïve didnt I'
Please note that \w
includes the underscore (_
), because it is considered "alphanumeric".
If you want to strip it as well, change the pattern to:
r'[^\w\s]|_'
Upvotes: 1
Reputation: 4172
I'd change my logic up on the strip_punc
function
from string import asci_letters
def strip_punc(word):
return ''.join(c for c in word if c in ascii_letters)
This logic is an explicit allow vs an explicit deny which means you are only allowing in the values you want vs only blocking the values you know you don't want i.e. leaves out any edge cases you didn't think about.
Also note this. Best way to strip punctuation from a string in Python
Upvotes: 0
Reputation: 5137
w/o knowing what is in the stopwords list, the fastest solution is to add this:
#Reads the stopwords into a list
stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]
stopwords.append('“i')
And continue with the rest of your code..
Upvotes: 0