fred russell
fred russell

Reputation: 347

Python regex using strange unicode characters

The following code does what I want to do:

one_sentence = lambda x: re.search(r'b|c|d', x)

As as well as the following:

if re.search(r'P' + chr(8868), 'aP' + chr(8868)):
    print (True)

But I cannot get the following to work:

if re.search(chr(8835)|chr(8868)|chr(8869), 'P' + chr(8868)):
    print (True)

I'm trying to make it so that if either of chr(8835) or chr(8868) or chr(8869) are in a string, then the code prints True.

Upvotes: 1

Views: 71

Answers (1)

Peter Gibson
Peter Gibson

Reputation: 19544

For the pipe | character to operate in the regular expression it needs to be a part of the pattern string (as you have in the first example re.search(r'b|c|d', x)). However, you are using it as a Python operator instead here:

>>> re.search(chr(8835)|chr(8868)|chr(8869), 'P' + chr(8868))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for |: 'str' and 'str'

Which is why you get an error. The | operator is a "bitwise or" and can't be applied to two strings. Instead you need to use strings containing pipes:

>>> re.search(chr(8835) + '|' + chr(8868) + '|' + chr(8869), 'P' + chr(8868))
<_sre.SRE_Match object; span=(1, 2), match='⊤'>

Or if you'd prefer, you can enter the hex values of the unicode characters straight into the string using the \uXXXX syntax, and include the pipes directly:

>>> hex(8835)
'0x2283'
>>> hex(8868)
'0x22a4'
>>> hex(8869)
'0x22a5'
>>> 
>>> '\u2283|\u22a4|\u22a5'
'⊃|⊤|⊥'
>>> re.search('\u2283|\u22a4|\u22a5', 'P\u22a4')
<_sre.SRE_Match object; span=(1, 2), match='⊤'>

Upvotes: 1

Related Questions