Ben
Ben

Reputation: 13352

Simplify nested HTML with Beautifulsoup

I am cleaning up arbitrary HTML for printing it. I don't need to preserve the structure because I control the CSS selectors and a simpler tree seems to cause fewer errors.

Is there an idiomatic way in Beautifulsoup that will allow me to reduce nesting, or do I just need to do the hard yards and manage the tree myself?

As a very simplified example, can I make this:

from bs4 import BeautifulSoup

doc = """
<article>
    <div>
        <div>
            <section>
                <div>
                    <div>
                        <p>Hello</p>
                    </div>
                </div>
            </section>
            <section>
                <div>
                    <div>
                        <p>World!</p>
                    </div>
                </div>
            </section>
        </div>
    </div>
</article>
"""

soup = BeautifulSoup(doc, "html.parser")

print(soup.prettify())

return this:

<article>
  <p>Hello</p>
  <p>World!</p>
</article>

I'm open to non-bs4 methods too, this just seems to be the cleanest way to deal with HTML.

Upvotes: 1

Views: 542

Answers (1)

Sushil
Sushil

Reputation: 5531

This should help you:

from bs4 import BeautifulSoup

doc = """
<article>
    <div>
        <div>
            <section>
                <div>
                    <div>
                        <p>Hello</p>
                    </div>
                </div>
            </section>
            <section>
                <div>
                    <div>
                        <p>World!</p>
                    </div>
                </div>
            </section>
        </div>
    </div>
</article>
"""
soup = BeautifulSoup(doc,'html5lib')

txt = soup.find_all()

final = ""
parents = []
lst = []
main_elements = []

for elem in txt:
    if elem.find_all() == []:
        lst = []
        main_elements.append(elem)
        for p in elem.parents:
            lst.append(p.name)
        parents.append(lst)

for index,lstt in enumerate(parents):
    if 'body' not in lstt:
        parents.remove(lstt)
        main_elements.pop(index)

final_parents = []

final_parents = [parents[x][parents[x].index('body')-1] for x in range(len(parents))]

for index,tag in enumerate(final_parents):
    final += (f"<{tag}>")
    final += (str(main_elements[index]))
    final += (f"</{tag}>")

print(final)

Output:

<article>
<p>Hello</p>
</article>
<article>
<p>World!</p>
</article>

If you want the output in the format that you have specified in ur question, you can replace the last for loop with this:

final += (f"<{final_parents[0]}>")

for index,tag in enumerate(final_parents):
    final += (str(main_elements[index]))

final += (f"</{final_parents[0]}>")

print(final)

Output:

<article>
<p>Hello</p>
<p>World!</p>
</article>

Upvotes: 1

Related Questions