Reputation: 1564
I have a string and I'm trying to remove all characters that are not alphanumeric nor in this set
'''!$%*()_-=+\/.,><:;'"?|'''.
I know this removes all non alphanumeric characters but how can I do better?
re.sub(r'\W+','',line)
Upvotes: 1
Views: 3683
Reputation: 2680
With credit to this thread: Remove specific characters from a string in python
First, there's no need to retype all the punctuation manually. The string module defines string.punctuation as a property for your convenience. (Use help(string)
to see other similar definitions available)
>>> import string
>>>string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
The exact application of the solution will take some fiddling to define undesired characters; a big downside is that in this form, it only removes the characters you tell it to remove. If you're sure your file is 100% ASCII characters, then you could define:
>>> delchars = ''.join(c for c in map(chr, range(256)) if c not in (string.punctuation + string.digits + string.letters) )
You can filter characters by throwing them out:
>>> text.translate(None, delchars)
EDIT: Here's some interesting timing information for the various methods: Stripping everything but alphanumeric chars from a string in Python
Upvotes: 4
Reputation: 22031
In Python 3.x, you can use the translate
method on string to remove characters you do not want:
>>> def remove(string, characters):
return string.translate(str.maketrans('', '', characters))
>>> import string
>>> remove(string.printable, string.ascii_letters + string.digits + \
'''!$%*()_-=+\/.,><:;'"?|''')
'#&@[]^`{}~ \t\n\r\x0b\x0c'
Upvotes: 1
Reputation: 602115
A Python 2.x non-regex solution:
punctuation = '''!$%*()_-=+\/.,><:;'"?|'''
allowed = string.digits + string.letters + punctuation
filter(allowed.__contains__, s)
The string to filter is s
. (This probably isn't the fastest solution for long strings.)
Upvotes: 7