Reputation: 1863
I try to clean some HTML data with regular expression in python. Given the input string with HTML tags, I want to remove tags and its content if the content contains space. The requirements is like below:
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = regexProcess(inputString)
print outputString
>>I want to remove not sole <code>word</code>
The regex re.sub("<code>.+?</code>", " ", inputString)
can only remove all tags, how to improve it or are there some other methods?
Thanks in advance.
Upvotes: 1
Views: 646
Reputation: 140256
bad idea to parse HTML with regex. However if your HTML is simple enough you could do this:
re.sub(r"<code>[^<]*\s[^<]*</code>", " ", inputString)
We're looking for at least a space somewhere, to be able to make it work with code
tags on the same line, I've added filtering on <
char (it has no chance to be in a tag, since even escaping it is <
).
Ok, it's still a hack, a proper html parser is preferred.
small test:
inputString = "<code>hello </code> <code>world</code> <code>hello world</code> <code>helloworld</code>"
I get:
<code>world</code> <code>helloworld</code>
Upvotes: 1
Reputation: 627103
Using regex with HTML is fraught with various issues, that is why you should be aware of all possible consequences. So, your <code>.+?</code>
regex will only work in case the <code>
and </code>
tags are on one line and if there are no nested <code>
tags inside them.
Assuming there are no nested code
tags you might extend your current approach:
import re
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = re.sub("<code>(.+?)</code>", lambda m: " " if " " in m.group(1) else m.group(), inputString, flags=re.S)
print(outputString)
The re.S
flag will enable .
to match line breaks and a lambda will help to perform a check against each match: any code tag that contains a whitespace in its node value will be turned into a regular space, else it will be kept.
See this Python demo
A more common way to parse HTML in Python is to use BeautifulSoup. First, parse the HTML, then get all the code
tags and then replace the code
tag if the nodes contains a space:
>>> from bs4 import BeautifulSoup
soup = BeautifulSoup('I want to remove <code>tag with space</code> not sole <code>word</code>', "html.parser")
>>> for p in soup.find_all('code'):
if p.string and " " in p.string:
p.replace_with(" ")
>>> print(soup)
I want to remove not sole <code>word</code>
Upvotes: 4
Reputation: 51
You can used to remove tags according to open and close tags also .
inputString = re.sub(r"<.*?>", " ", inputString)
In my case it is working . Enjoy ...
Upvotes: 0