user2366975
user2366975

Reputation: 4710

Removing html hyperlinks anchor from text with regex (in python, pyqt4)

In my QTextBrowser I detect links like "www.test.com" with

re.compile(   r"(\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])"   )

When further actions on the QTextBrowser occur, the text is received again with text.toHtml() and then parsed again. This leads to cascaded hyperlinks.

So I want, before parsing again, the hyperlinks HTML to be removed. For example, the text looks like

<a href="www.test.com">www.test.com</a> 

after first parsing and should look like

www.test.com

before the second parsing, to prevent cascading.

How do I remove

<a href="SOMETHING"> and </a>

with a regex?

Other html-tags like bold or italic should not be removed.#

EDIT

I've heard about not parsing HTML with regex, but I think here it should be possible and I don't want further dependencies in my program.

Upvotes: 0

Views: 1803

Answers (1)

hwnd
hwnd

Reputation: 70732

I would consider using BeautifulSoup for this task.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> for m in soup.find_all('a'):
...     m.replaceWithChildren()
>>> print soup

Upvotes: 2

Related Questions