Sam
Sam

Reputation: 2605

Regex- To handle null (when no characters are present between expressions)

I have a regex situation.

My text looks like :

text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'

I want to capture all the hyperlinks, The regex I have written is given below-

re.findall("<a href=.+?>(.+?)</a>", text, re.DOTALL)

When I run this it given me an output:

['</a></div>abcd<i><a href=">World Bank']

The above output occurs because there is no character between

<a href="></a> 

When I insert any character between the above expressions, I get Correct output.

From the above text I need an output that is

['World Bank']

How can I modify the regex to get the above output.

Upvotes: 1

Views: 241

Answers (2)

Avinash Raj
Avinash Raj

Reputation: 174776

As mentioned by the other answerer, don't use regex for parsing html files.

>>> import re
>>> text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
>>> re.findall(r"(?s)<a href=.+?>([^<>]+)</a>", text)
['World Bank']

[^<>]+ negated character class which matches any character but not of < or >, one or more times. So this would capture World Bank only.

Let me explain why findall produces the undesired output.

<a href=.+?>(.+?)</a> 

<a href=.+?> matches all the opening anchor tag. (.+?)</a> captures one or more characters non-greedily until the closing a tag is reached. So this would match all the charcaters </a></div>abcd<i><a href=">World Bank until the next </a>. If you use (.*?) then you get two outputs, an empty string and World Bank

Upvotes: 0

alecxe
alecxe

Reputation: 474001

Why don't use an HTML Parser instead?

Example using BeautifulSoup:

In [1]: from bs4 import BeautifulSoup

In [2]: text = 'abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
In [3]: soup = BeautifulSoup(text, "html.parser")

In [4]: [a.get_text() for a in soup.find_all("a")]
Out[4]: [u'World Bank']

Upvotes: 3

Related Questions