DaveTheAl
DaveTheAl

Reputation: 2155

Remove redundant beautifulsoup html tags

How can I remove "redundant" html tags inside a beautifulsoup object?

In the example of

<html>
 <body>
  <div>
   <div>
    <div>
     <div>
      <div>
       <div>
        Close
       </div>
      </div>
     </div>
    </div>
   </div>
   <div>
    <div>
     <div style="width:80px">
      <div>
      </div>
      <div>
       <button>
        Close
       </button>
      </div>
     </div>
    </div>
   </div>
  </div>
  <div>
  </div>
 </body>
</html>

how can I remove redundant <div> tags (redundant, as in that they only add to the depth, but do not contain any addition information or attributes) to the following structure:

<html>
 <body>
       <div>
        Close
       </div>
     <div style="width:80px">
       <button>
        Close
       </button>
     </div>
 </body>
</html>

In terms of a graph-algorithm, I am trying to merge multiple nodes together within the beautifulsoup tree that do not contain stringts, nor attributes.

Upvotes: 1

Views: 281

Answers (2)

DaveTheAl
DaveTheAl

Reputation: 2155

I just created a code-snippet that seems to do the job:

        for x in reversed(soup()):
            if not x.string and not x.attrs and len(x.findChildren(recursive=False)) <= 1:
                x.unwrap()

The reversed is needed, as otherwise empty tags are counted as siblings, blocking the unwrapping.

Upvotes: 0

Stef
Stef

Reputation: 30619

You can use unwrap() to replace any divs without attributes (i.e. div.attrs == {}) with their children:

for div in soup.find_all('div'):
    if not div.attrs:
        div.unwrap()

Output of print(soup.prettify()):

<html>
 <body>
  <button>
   Close
  </button>
  <div style="width:80px">
   <button>
    Close
   </button>
  </div>
 </body>
</html>

For the updated example (see comment) it would be:

for div in soup.find_all('div'):
    if not div.attrs and div.div:
        div.unwrap()

i.e. remove div if it has no attributes and if it's followed by another div

Upvotes: 2

Related Questions