Reputation: 179
Is there any way I can control the depth of unwrapping? My HTML's sometimes contain css. And prettify adds newline to every tag...
<html><body><h1>hello world</h1></body></html>
to:
<html>
<body><h1>hello world</h1></body>
</html>
from bs4 import BeautifulSoup
INPUT_FILE = "html_unformatted.txt"
OUTPUT_FILE = "index.html"
unicode_data = open(INPUT_FILE, "r", encoding='unicode_escape').read()
data = unicode_data.encode('iso-8859-1').decode('utf-8')
soup = BeautifulSoup(data, features="html.parser")
pretty_html = soup.prettify()
with open(OUTPUT_FILE, "w") as f:
f.write(pretty_html)
print(f"Wrote to {OUTPUT_FILE}")
I have:
<html>
<body>
<h1>
hello world
</h1>
</body>
</html>
Upvotes: 1
Views: 864
Reputation: 758
Unfortunately, according to beautifulsoup docs, customising the prettify
function is not an option. However, one could wrap soup.prettify
into another function and replace the "pretty" text with one-line text.
That's what prettify_except
below does, i.e. prettifying anything but the text contained in tag_name
:
from bs4 import BeautifulSoup
import re
html = "<html><body><h1>hello world</h1></body></html>"
soup = BeautifulSoup(html, features="html.parser")
print(soup.prettify())
def prettify_except(soup_obj: BeautifulSoup, tag_name: str) -> str:
regex_string = "<{0}>.*<\/{0}>".format(tag_name)
regex = re.compile(regex_string, re.DOTALL)
replacing_txt = str(getattr(soup_obj, tag_name))
return re.sub(regex, replacing_txt, soup_obj.prettify())
print(prettify_except(soup, 'body'))
# original prettified
# <html>
# <body>
# <h1>
# hello world
# </h1>
# </body>
# </html>
# prettified, except body
# <html>
# <body><h1>hello world</h1></body>
# </html>
Upvotes: 3