Strip white-space from HTML before parsing

Question

I have a Python dictionary containing HTML that I would later like to parse using beautifulsoup, but before parsing I would like to remove white-space directly adjacent to tag elements.

For example:

string = "text some texts  text some text"
>>> remove_whitespace(string)
'textsome textstextsome text'

Tim Pietzcker · Accepted Answer

Assuming that you're allowing any kind of tag name, and that tags never contain angle brackets within them, you can quickly solve this with a regex:

>>> import re
>>> string = "text some texts  text some text"
>>> regex = re.compile(r"\s*(<[^<>]+>)\s*")
>>> regex.sub("\g<1>", string)
'textsome textstextsome text'

Explanation:

\s*     # Match any number of whitespace characters
(       # Match and capture in group 1:
 <      # - an opening angle bracket
 [^<>]+ # - one or more characters except angle brackets
 >      # - a closing angle bracket
)       # End of group 1 (used to restore the matched text later)
\s*     # Match any number of whitespace characters

Strip white-space from HTML before parsing

Answers (1)

Related Questions