Regex- To handle null (when no characters are present between expressions)

Question

I have a regex situation.

My text looks like :

text='abcdWorld Bank'

I want to capture all the hyperlinks, The regex I have written is given below-

re.findall("(.+?)", text, re.DOTALL)

When I run this it given me an output:

['abcd

When I insert any character between the above expressions, I get Correct output.

From the above text I need an output that is

['World Bank']

How can I modify the regex to get the above output.

Avinash Raj · Accepted Answer

As mentioned by the other answerer, don't use regex for parsing html files.

>>> import re
>>> text='abcdWorld Bank'
>>> re.findall(r"(?s)([^<>]+)", text)
['World Bank']

[^<>]+ negated character class which matches any character but not of < or >, one or more times. So this would capture World Bank only.

Let me explain why findall produces the undesired output.

(.+?)

matches all the opening anchor tag. (.+?) captures one or more characters non-greedily until the closing a tag is reached. So this would match all the charcaters abcd

Regex- To handle null (when no characters are present between expressions)

Answers (2)

Related Questions