quadratecode
quadratecode

Reputation: 484

Beautifulsoup find_all() with multiple AND conditions

I have a a HTML source with multiple <sup></sup> tags of which some also contain <a></a> tags. My goal is to delete all superscripts which also contain a link while wrapping the superscripts without a link in square brackets. My code works to a certain extent: it decomposes the correct <a> tags but also wraps the now empty <sup> tags in brackets. I tried to find a way to use find_all() with multiple AND conditions for the <sup> and <a> tag but to no avail. P.m.: SO has a thread regarding multiple OR conditions.

from bs4 import BeautifulSoup

html = '''<h1>
        <b> Heading </b>
      <sup>
        <a href="LINK_1">1</a>
      </sup>
    </h1>
    <p>
      <sup>2</sup>
      This is a paragraph
      <sup><a href="LINK_2">3</a></sup>
    </p>'''

soup = BeautifulSoup(html, "html.parser")

# remove superscripts with links
for superscript in soup.find_all("sup"):
    for suplink in superscript.find_all("a"):
        suplink.decompose()

# wrap remaining superscripts in brackets
    superscript.insert(0, "[")
    superscript.insert(len(superscript.contents), "]")

print(html)

Result:

    <h1>
    <b> Heading </b>
    <sup>[
    
    ]</sup>
    </h1>
    <p>
    <sup>[2]</sup>
        This is a paragraph
        <sup>[]</sup>
    </p>

What it should look like:

    <h1>
    <b> Heading </b>
    </h1>
    <p>
    <sup>[2]</sup>
          This is a paragraph
    </p>

Upvotes: 2

Views: 454

Answers (1)

Ajax1234
Ajax1234

Reputation: 71451

You can recursively traverse the source and update the sups accordingly:

import bs4
def update_sup(d):
   if d.name == 'sup':
      if any(not isinstance(i, bs4.element.NavigableString) for i in d.contents):
         d.extract()
      else:
         d.string = f'[{d.get_text(strip=True)}]'
   for i in filter(lambda x:not isinstance(x, bs4.element.NavigableString), d.contents):
       update_sup(i)

html = '''<h1>
    <b> Heading </b>
  <sup>
    <a href="LINK_1">1</a>
  </sup>
</h1>
<p>
  <sup>2</sup>
  This is a paragraph
  <sup><a href="LINK_2">3</a></sup>
</p>'''
d = bs4.BeautifulSoup(html, 'html.parser')
update_sup(d)
print(d)

Output:

<h1>
<b> Heading </b>

</h1>
<p>
<sup>[2]</sup>
  This is a paragraph
  
</p>

Upvotes: 1

Related Questions