user1063287
user1063287

Reputation: 10879

How to remove all special characters except spaces and dashes from a Python string?

I want to strip all special characters from a Python string, except dashes and spaces.

Is this correct?

import re
my_string = "Web's GReat thing-ok"
pattern = re.compile('[^A-Za-z0-9 -]')
new_string = pattern.sub('',my_string)
new_string
>> 'Webs GReat thing-ok'
# then make it lowercase and replace spaces with underscores
# new_string = new_string.lower().replace (" ", "_")
# new_string
# >> 'webs_great_thing-ok'

As shown, I ultimately want to replace the spaces with underscores after removing the other special characters, but figured I would do it in stages. Is there a Pythonic way to do it all in one fell swoop?

For context, I am using this input for MongoDB collection names, so want the constraint of the final string to be: alphanumeric with dashes and underscores allowed.

Upvotes: 1

Views: 16607

Answers (2)

Burhan Khalid
Burhan Khalid

Reputation: 174662

A one-liner, as requested:

>>> import re, unicodedata
>>> value = "Web's GReat thing-ok"
>>> re.sub('[\s]+', '_', re.sub('[^\w\s-]', '', unicodedata.normalize('NFKD', unicode(value)).encode('ascii', 'ignore').decode('ascii')).strip().lower())
u'webs_great_thing-ok'

Upvotes: 1

DeepSpace
DeepSpace

Reputation: 81654

You are actually trying to "slugify" your string.

If you don't mind using a 3rd party (and a Python 2 specific) library you can use slugify (pip install slugify):

import slugify

string = "Web's GReat thing-ok"
print slugify.slugify(string)
>> 'webs_great_thing-ok'

You can implement it yourself. All of slugify's code is

import re
import unicodedata

def slugify(string):
    return re.sub(r'[-\s]+', '-',
            unicode(
                    re.sub(r'[^\w\s-]', '',
                           unicodedata.normalize('NFKD', string)
                           .encode('ascii', 'ignore'))
                           .strip()
                           .lower())

Note that this is Python 2 specific.


Going back to your example, You can make it a one-liner. Whether it is Pythonic enough is up to you to decide (note the shortened range A-z instead of A-Za-z):

import re

my_string = "Web's GReat thing-ok"
new_string = re.sub('[^A-z0-9 -]', '', my_string).lower().replace(" ", "_")


UPDATE There seems to be more robust and Python 3 compatible "slugify" library here.

Upvotes: 4

Related Questions