Reputation: 13352
I am cleaning up arbitrary HTML for printing it. I don't need to preserve the structure because I control the CSS selectors and a simpler tree seems to cause fewer errors.
Is there an idiomatic way in Beautifulsoup that will allow me to reduce nesting, or do I just need to do the hard yards and manage the tree myself?
As a very simplified example, can I make this:
from bs4 import BeautifulSoup
doc = """
<article>
<div>
<div>
<section>
<div>
<div>
<p>Hello</p>
</div>
</div>
</section>
<section>
<div>
<div>
<p>World!</p>
</div>
</div>
</section>
</div>
</div>
</article>
"""
soup = BeautifulSoup(doc, "html.parser")
print(soup.prettify())
return this:
<article>
<p>Hello</p>
<p>World!</p>
</article>
I'm open to non-bs4 methods too, this just seems to be the cleanest way to deal with HTML.
Upvotes: 1
Views: 542
Reputation: 5531
This should help you:
from bs4 import BeautifulSoup
doc = """
<article>
<div>
<div>
<section>
<div>
<div>
<p>Hello</p>
</div>
</div>
</section>
<section>
<div>
<div>
<p>World!</p>
</div>
</div>
</section>
</div>
</div>
</article>
"""
soup = BeautifulSoup(doc,'html5lib')
txt = soup.find_all()
final = ""
parents = []
lst = []
main_elements = []
for elem in txt:
if elem.find_all() == []:
lst = []
main_elements.append(elem)
for p in elem.parents:
lst.append(p.name)
parents.append(lst)
for index,lstt in enumerate(parents):
if 'body' not in lstt:
parents.remove(lstt)
main_elements.pop(index)
final_parents = []
final_parents = [parents[x][parents[x].index('body')-1] for x in range(len(parents))]
for index,tag in enumerate(final_parents):
final += (f"<{tag}>")
final += (str(main_elements[index]))
final += (f"</{tag}>")
print(final)
Output:
<article>
<p>Hello</p>
</article>
<article>
<p>World!</p>
</article>
If you want the output in the format that you have specified in ur question, you can replace the last for
loop with this:
final += (f"<{final_parents[0]}>")
for index,tag in enumerate(final_parents):
final += (str(main_elements[index]))
final += (f"</{final_parents[0]}>")
print(final)
Output:
<article>
<p>Hello</p>
<p>World!</p>
</article>
Upvotes: 1