jamiet
jamiet

Reputation: 12254

regex to replace everything except lowercase letters, numeric characters, underscores, and dashes

I have this function which is intended to take a string as input and replaces anything that isn't a letter, numeric digit, underscore or dash:

def clean_label_value(label_value):
    """
    GCP Label values have to follow strict guidelines
        Keys and values can only contain lowercase letters, numeric characters, underscores,
        and dashes. International characters are allowed.
    https://cloud.google.com/compute/docs/labeling-resources#restrictions
    :param label_value: label value that needs to be cleaned up
    :return: cleaned label value
    """
    full_pattern = re.compile('[^a-zA-Z0-9]')
    return re.sub(full_pattern, '_', label_value).lower()

I have this unit test, which succeeds

def test_clean_label_value(self):
    self.assertEqual(clean_label_value('XYZ_@:.;\\/,'), 'xyz________')

however its replacing dashes, which I don't want it to. To demonstrate:

def clean_label_value(label_value):
    full_pattern = re.compile('[^a-zA-Z0-9]|-')
    return re.sub(full_pattern, '_', label_value).lower()

but this:

def test_clean_label_value(self):
    self.assertEqual(clean_label_value('XYZ-'), 'xyz-')

then failed with

xyz- != xyz_

Expected :xyz_
Actual :xyz-

In other words, the - is getting replaced with a _. I don't want that to happen. I've fiddled around with the regex, trying all sorts of different combinations, but I can't figure the darned thing out. Anyone?

Upvotes: 0

Views: 339

Answers (1)

Håken Lid
Håken Lid

Reputation: 23064

Put a single - at the very beginning or end of the set (character class). Then it doesn't create a character range, but represents the literal - character itself.

re.compile('[^-a-zA-Z0-9]')

It's also possible to escape the - with a \, to indicate that it's a literal dash character and not a range operator inside a set.

re.compile(r'[^\-\w]')

The special sequence \w is equivalent to the set [a-zA-Z0-9_] ("w" for "word character").

Upvotes: 5

Related Questions