Reputation: 2155
How can I remove "redundant" html tags inside a beautifulsoup object?
In the example of
<html>
<body>
<div>
<div>
<div>
<div>
<div>
<div>
Close
</div>
</div>
</div>
</div>
</div>
<div>
<div>
<div style="width:80px">
<div>
</div>
<div>
<button>
Close
</button>
</div>
</div>
</div>
</div>
</div>
<div>
</div>
</body>
</html>
how can I remove redundant <div>
tags (redundant, as in that they only add to the depth, but do not contain any addition information or attributes) to the following structure:
<html>
<body>
<div>
Close
</div>
<div style="width:80px">
<button>
Close
</button>
</div>
</body>
</html>
In terms of a graph-algorithm, I am trying to merge multiple nodes together within the beautifulsoup tree that do not contain stringts, nor attributes.
Upvotes: 1
Views: 281
Reputation: 2155
I just created a code-snippet that seems to do the job:
for x in reversed(soup()):
if not x.string and not x.attrs and len(x.findChildren(recursive=False)) <= 1:
x.unwrap()
The reversed
is needed, as otherwise empty tags are counted as siblings, blocking the unwrapping.
Upvotes: 0
Reputation: 30619
You can use unwrap()
to replace any divs without attributes (i.e. div.attrs == {}
) with their children:
for div in soup.find_all('div'):
if not div.attrs:
div.unwrap()
Output of print(soup.prettify())
:
<html>
<body>
<button>
Close
</button>
<div style="width:80px">
<button>
Close
</button>
</div>
</body>
</html>
For the updated example (see comment) it would be:
for div in soup.find_all('div'):
if not div.attrs and div.div:
div.unwrap()
i.e. remove div if it has no attributes and if it's followed by another div
Upvotes: 2