user1478335
user1478335

Reputation: 1839

Python remove punctuation from text

I am reading a thousand line Italian text and creating a dictionary of unique words. I have tried two methods of removing the punctuation: using string

for p in string.punctuation:
     word = word.replace(p, str())

or :

for line in f:
    for word in line.split():
        stripped_text =""
        for char in word:
            if char in '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~>><<<<?>>?123456789':
               char = ''
               stripped_text += char

My problem is that this still contains punctuation:

{'<<Dicerolti': 1,'piage>>.': 1,'succia?>>.': 1,…}

Any ideas, please?

Upvotes: 1

Views: 1657

Answers (1)

synthesizerpatel
synthesizerpatel

Reputation: 28036

You could use the re module for this and a little printf style trick to build a regex that flags any punctuation for replacement.

import string
import re
a = '>>some_crazy_string..!'
print re.sub('[%s]' % string.punctuation,'',a)

prints out

somecrazystring

I've used this trick a couple of times for 'anonymizing' log files.

Upvotes: 1

Related Questions