Removing html hyperlinks anchor from text with regex (in python, pyqt4)

Question

In my QTextBrowser I detect links like "www.test.com" with

re.compile(   r"(\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])"   )

When further actions on the QTextBrowser occur, the text is received again with text.toHtml() and then parsed again. This leads to cascaded hyperlinks.

So I want, before parsing again, the hyperlinks HTML to be removed. For example, the text looks like

www.test.com

after first parsing and should look like

www.test.com

before the second parsing, to prevent cascading.

How do I remove

and

with a regex?

Other html-tags like bold or italic should not be removed.#

EDIT

I've heard about not parsing HTML with regex, but I think here it should be possible and I don't want further dependencies in my program.

hwnd · Accepted Answer

I would consider using BeautifulSoup for this task.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> for m in soup.find_all('a'):
...     m.replaceWithChildren()
>>> print soup

Removing html hyperlinks anchor from text with regex (in python, pyqt4)

Answers (1)

Related Questions