Reputation: 484
I have a a HTML source with multiple <sup></sup>
tags of which some also contain <a></a>
tags. My goal is to delete all superscripts which also contain a link while wrapping the superscripts without a link in square brackets. My code works to a certain extent: it decomposes the correct <a>
tags but also wraps the now empty <sup>
tags in brackets. I tried to find a way to use find_all() with multiple AND conditions for the <sup>
and <a>
tag but to no avail. P.m.: SO has a thread regarding multiple OR conditions.
from bs4 import BeautifulSoup
html = '''<h1>
<b> Heading </b>
<sup>
<a href="LINK_1">1</a>
</sup>
</h1>
<p>
<sup>2</sup>
This is a paragraph
<sup><a href="LINK_2">3</a></sup>
</p>'''
soup = BeautifulSoup(html, "html.parser")
# remove superscripts with links
for superscript in soup.find_all("sup"):
for suplink in superscript.find_all("a"):
suplink.decompose()
# wrap remaining superscripts in brackets
superscript.insert(0, "[")
superscript.insert(len(superscript.contents), "]")
print(html)
Result:
<h1>
<b> Heading </b>
<sup>[
]</sup>
</h1>
<p>
<sup>[2]</sup>
This is a paragraph
<sup>[]</sup>
</p>
What it should look like:
<h1>
<b> Heading </b>
</h1>
<p>
<sup>[2]</sup>
This is a paragraph
</p>
Upvotes: 2
Views: 454
Reputation: 71451
You can recursively traverse the source and update the sup
s accordingly:
import bs4
def update_sup(d):
if d.name == 'sup':
if any(not isinstance(i, bs4.element.NavigableString) for i in d.contents):
d.extract()
else:
d.string = f'[{d.get_text(strip=True)}]'
for i in filter(lambda x:not isinstance(x, bs4.element.NavigableString), d.contents):
update_sup(i)
html = '''<h1>
<b> Heading </b>
<sup>
<a href="LINK_1">1</a>
</sup>
</h1>
<p>
<sup>2</sup>
This is a paragraph
<sup><a href="LINK_2">3</a></sup>
</p>'''
d = bs4.BeautifulSoup(html, 'html.parser')
update_sup(d)
print(d)
Output:
<h1>
<b> Heading </b>
</h1>
<p>
<sup>[2]</sup>
This is a paragraph
</p>
Upvotes: 1