DevEx
DevEx

Reputation: 4571

Strip white-space from HTML before parsing

I have a Python dictionary containing HTML that I would later like to parse using beautifulsoup, but before parsing I would like to remove white-space directly adjacent to tag elements.

For example:

string = "text <tag>some texts</tag> <tag> text</tag> some text"
>>> remove_whitespace(string)
'text<tag>some texts</tag><tag>text</tag>some text'

Upvotes: 0

Views: 1399

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336408

Assuming that you're allowing any kind of tag name, and that tags never contain angle brackets within them, you can quickly solve this with a regex:

>>> import re
>>> string = "text <tag>some texts</tag> <tag> text</tag> some text"
>>> regex = re.compile(r"\s*(<[^<>]+>)\s*")
>>> regex.sub("\g<1>", string)
'text<tag>some texts</tag><tag>text</tag>some text'

Explanation:

\s*     # Match any number of whitespace characters
(       # Match and capture in group 1:
 <      # - an opening angle bracket
 [^<>]+ # - one or more characters except angle brackets
 >      # - a closing angle bracket
)       # End of group 1 (used to restore the matched text later)
\s*     # Match any number of whitespace characters

Upvotes: 1

Related Questions