Reputation: 4710
In my QTextBrowser I detect links like "www.test.com" with
re.compile( r"(\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])" )
When further actions on the QTextBrowser occur, the text is received again with text.toHtml()
and then parsed again. This leads to cascaded hyperlinks.
So I want, before parsing again, the hyperlinks HTML to be removed. For example, the text looks like
<a href="www.test.com">www.test.com</a>
after first parsing and should look like
www.test.com
before the second parsing, to prevent cascading.
How do I remove
<a href="SOMETHING"> and </a>
with a regex?
Other html-tags like bold or italic should not be removed.#
EDIT
I've heard about not parsing HTML with regex, but I think here it should be possible and I don't want further dependencies in my program.
Upvotes: 0
Views: 1803
Reputation: 70732
I would consider using BeautifulSoup for this task.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> for m in soup.find_all('a'):
... m.replaceWithChildren()
>>> print soup
Upvotes: 2