Filipe Aleixo
Filipe Aleixo

Reputation: 4242

Regex Python - Keep only ASCII and copyright symbol

I have the following function to keep only ASCII characters:

import re

def remove_special_chars(txt):
    txt = re.sub(r'[^\x00-\x7F]+', '', txt)
    return txt

But now I also want to keep the copyright symbol (©). What should I add to the pattern?

Upvotes: 3

Views: 772

Answers (2)

J...S
J...S

Reputation: 5207

From Python3.8 onwards Unicode character name may be used in regular expressions via \N{name} escapes.

So copyright symbol may be used like r'\N{COPYRIGHT SIGN}'.

In your case,

import re
s = "Python® Copyright ©2001-2019"
re.sub(r'[^\x00-\x7F\N{COPYRIGHT SIGN}]+', '', s)

This gives

'Python Copyright ©2001-2019'

As you can see, the '®' was removed.

Upvotes: 0

esqew
esqew

Reputation: 44714

Add the copyright symbol's hex \xA9 (source) to your match group:

txt = re.sub(r'[^\x00-\x7F\xA9]+', '', txt)

Regex101

Upvotes: 7

Related Questions