Reputation: 1863

How to use regex to remove string within certain HTML tag and string must contain empty space

I try to clean some HTML data with regular expression in python. Given the input string with HTML tags, I want to remove tags and its content if the content contains space. The requirements is like below:

inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = regexProcess(inputString)
print outputString

>>I want to remove not sole <code>word</code>

The regex re.sub("<code>.+?</code>", " ", inputString) can only remove all tags, how to improve it or are there some other methods?

Thanks in advance.

Upvotes: 1

Answers (3)

Jean-François Fabre

Reputation: 140256

bad idea to parse HTML with regex. However if your HTML is simple enough you could do this:

re.sub(r"<code>[^<]*\s[^<]*</code>", " ", inputString)

We're looking for at least a space somewhere, to be able to make it work with code tags on the same line, I've added filtering on < char (it has no chance to be in a tag, since even escaping it is <).

Ok, it's still a hack, a proper html parser is preferred.

small test:

inputString = "<code>hello </code>  <code>world</code> <code>hello world</code> <code>helloworld</code>"

I get:

  <code>world</code>   <code>helloworld</code>

Upvotes: 1

Wiktor Stribiżew

Reputation: 627103

Using regex with HTML is fraught with various issues, that is why you should be aware of all possible consequences. So, your <code>.+?</code> regex will only work in case the <code> and </code> tags are on one line and if there are no nested <code> tags inside them.

Assuming there are no nested code tags you might extend your current approach:

import re
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = re.sub("<code>(.+?)</code>", lambda m: " " if " " in m.group(1) else m.group(), inputString, flags=re.S)
print(outputString)

The re.S flag will enable . to match line breaks and a lambda will help to perform a check against each match: any code tag that contains a whitespace in its node value will be turned into a regular space, else it will be kept.

See this Python demo

A more common way to parse HTML in Python is to use BeautifulSoup. First, parse the HTML, then get all the code tags and then replace the code tag if the nodes contains a space:

>>> from bs4 import BeautifulSoup
soup = BeautifulSoup('I want to remove <code>tag with space</code> not sole <code>word</code>', "html.parser")
>>> for p in soup.find_all('code'):
    if p.string and " " in p.string:
        p.replace_with(" ")

>>> print(soup)
I want to remove   not sole <code>word</code>

Upvotes: 4

rofelia09

Reputation: 51

You can used to remove tags according to open and close tags also .

inputString = re.sub(r"<.*?>", " ", inputString)

In my case it is working . Enjoy ...

Upvotes: 0

How to use regex to remove string within certain HTML tag and string must contain empty space

Answers (3)

Related Questions