Reputation: 2605
I have a regex situation.
My text looks like :
text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
I want to capture all the hyperlinks, The regex I have written is given below-
re.findall("<a href=.+?>(.+?)</a>", text, re.DOTALL)
When I run this it given me an output:
['</a></div>abcd<i><a href=">World Bank']
The above output occurs because there is no character between
<a href="></a>
When I insert any character between the above expressions, I get Correct output.
From the above text I need an output that is
['World Bank']
How can I modify the regex to get the above output.
Upvotes: 1
Views: 241
Reputation: 174776
As mentioned by the other answerer, don't use regex for parsing html files.
>>> import re
>>> text='abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
>>> re.findall(r"(?s)<a href=.+?>([^<>]+)</a>", text)
['World Bank']
[^<>]+
negated character class which matches any character but not of <
or >
, one or more times. So this would capture World Bank
only.
Let me explain why findall produces the undesired output.
<a href=.+?>(.+?)</a>
<a href=.+?>
matches all the opening anchor tag.
(.+?)</a>
captures one or more characters non-greedily until the closing a
tag is reached. So this would match all the charcaters </a></div>abcd<i><a href=">World Bank
until the next </a>
. If you use (.*?)
then you get two outputs, an empty string and World Bank
Upvotes: 0
Reputation: 474001
Why don't use an HTML Parser instead?
Example using BeautifulSoup
:
In [1]: from bs4 import BeautifulSoup
In [2]: text = 'abcd<a href="></a></div>abcd<i><a href=">World Bank</a>'
In [3]: soup = BeautifulSoup(text, "html.parser")
In [4]: [a.get_text() for a in soup.find_all("a")]
Out[4]: [u'World Bank']
Upvotes: 3