Reputation: 1839
I am reading a thousand line Italian text and creating a dictionary of unique words. I have tried two methods of removing the punctuation: using string
for p in string.punctuation:
word = word.replace(p, str())
or :
for line in f:
for word in line.split():
stripped_text =""
for char in word:
if char in '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~>><<<<?>>?123456789':
char = ''
stripped_text += char
My problem is that this still contains punctuation:
{'<<Dicerolti': 1,'piage>>.': 1,'succia?>>.': 1,…}
Any ideas, please?
Upvotes: 1
Views: 1657
Reputation: 28036
You could use the re module for this and a little printf style trick to build a regex that flags any punctuation for replacement.
import string
import re
a = '>>some_crazy_string..!'
print re.sub('[%s]' % string.punctuation,'',a)
prints out
somecrazystring
I've used this trick a couple of times for 'anonymizing' log files.
Upvotes: 1