gammapoint
gammapoint

Reputation: 1153

Elegant way to replace the tag on a sequence of bs4 soup tags and wrap them all in another tag

I have an html string like so:

test = """<body>
    <h1>A header</h1>
    <p> Just a normal paragraph <b>before</b> </p>
    <p>• test element</p>
    <p>• test element2</p>
    <p> Following <i>stuff</i></p>

    </body>
    """

This user has explicitly included the u'\u2022' bullet character instead of using a list. I would like to get the following converted html

<body>
    <h1>A header</h1>
    <p> Just a normal paragraph <b>before</b> </p>
    <ul>
    <li>test element</li>
    <li>test element2</li>
    </ul>
    <p> Following <i>stuff</i></p>

</body>

What is the most elegant way to approach this? I can identify then these bulleted items occur from a simple .find on the tag string. I can remove the bullets and wrap them in <li> tags. But I don't know how to iterate through the soup and then wrap all the bullets into a single <ul> tag. If I could iterate through the soup like a normal list and create a modified list with new elements I could do something like this pseudo-code:

new_soup = []
for tag in soup:
   if has_bullet(tag):
       #  start storing tags
       bullets.append(tag)
   else:
       if bullets: # if we have some bullets to dump
           new_soup.append(ul_tag_start)
           new_soup.extend(modify_text(bullets))
           new_soup.append(ul_tag_end)
       new_soup.append(tag)
       # clear bullets list
       bullets = []

but I don't know to write a new soup element by element, and I wonder if there's a better way using bs4's various insert, insert_before, insert_after, etc. methods.

Upvotes: 1

Views: 110

Answers (1)

Ajax1234
Ajax1234

Reputation: 71471

You can use recursion:

import bs4, re
from bs4 import BeautifulSoup as soup
test = """<h1>A header</h1><p> Just a normal paragraph <b>before</b> </p><p>• test element</p><p>• test element2<b>something else</b></p><p> Following <i>stuff</i></p>"""
def form_ul(d):
   return soup('<ul>{}</ul>'.format('\n'.join(f'<li>{i}</li>' for i in d)), 'html.parser').ul

def to_ul(d):
   c,l = [],[]
   for i in d.contents:
      if isinstance(i, bs4.NavigableString):
         c.append(i)
      else:
         if str(i.get_text(strip=True)).startswith(u'\u2022'):
            l.append('\n'.join(j.replace(u'\u2022 ', '') if isinstance(j, bs4.NavigableString) else str(j) for j in i.contents))
         else:
            if l:
               c.append(form_ul(l))
               l = []
            to_ul(i)
            c.append(i)
   if l:
      c.append(form_ul(l))
   d.contents = [j for j in c if not re.findall('^\n+$', str(j))]

html = soup(test, 'html.parser')
to_ul(html)
print(soup.prettify(html))

Output:

<h1>
 A header
</h1>
<p>
 Just a normal paragraph
 <b>
  before
 </b>
</p>
<ul>
 <li>
  test element
 </li>
 <li>
  test element2
  <b>
   something else
  </b>
 </li>
</ul><p>
 Following
 <i>
  stuff
 </i>
</p>

Upvotes: 1

Related Questions