user1992989
user1992989

Reputation: 85

Correction in Regex for unicode

I need help for regex. My regex is not producing the desired results. Below is my code:

import re
text='<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one 
on the spot<u+26a1>'
regex=re.compile(r'[<u+\w+]+>')
txt=regex.findall(text)
print(txt)

Output

['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', 'loved<u+2764>', '<u+fe0f>', 'spot<u+26a1>']

I know, regex is not correct. I want output as:

 '<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>'

Upvotes: 0

Views: 49

Answers (2)

dopstar
dopstar

Reputation: 1488

import re

regex = re.compile(r'<u\+[0-9a-f]+>')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'

print(regex.findall(text))

# output:
['<u+0001f48e>', '<u+0001f6e0>', '<u+fe0f>', '<u+2764>', '<u+fe0f>', '<u+26a1>']

That is not exactly what you want, but its almost there.

Now, to achieve what you are looking for, we make our regex more eager:

import re

regex = re.compile(r'((?:<u\+[0-9a-f]+>)+)')
text = '<u+0001f48e> repairs <u+0001f6e0><u+fe0f>your loved<u+2764><u+fe0f>one on the spot<u+26a1>'

print(regex.findall(text))

# output:
['<u+0001f48e>', '<u+0001f6e0><u+fe0f>', '<u+2764><u+fe0f>', '<u+26a1>']

Upvotes: 1

sophros
sophros

Reputation: 16660

Why won't you add optional 2nd tag search:

regex=re.compile(r'<([u+\w+]+>(<u+fe0f>)?)')

This one works fine with your example.

Upvotes: 0

Related Questions