Reputation: 1153
I have an html string like so:
test = """<body>
<h1>A header</h1>
<p> Just a normal paragraph <b>before</b> </p>
<p>• test element</p>
<p>• test element2</p>
<p> Following <i>stuff</i></p>
</body>
"""
This user has explicitly included the u'\u2022'
bullet character instead of using a list. I would like to get the following converted html
<body>
<h1>A header</h1>
<p> Just a normal paragraph <b>before</b> </p>
<ul>
<li>test element</li>
<li>test element2</li>
</ul>
<p> Following <i>stuff</i></p>
</body>
What is the most elegant way to approach this? I can identify then these bulleted items occur from a simple .find on the tag string. I can remove the bullets and wrap them in <li>
tags. But I don't know how to iterate through the soup and then wrap all the bullets into a single <ul>
tag. If I could iterate through the soup like a normal list and create a modified list with new elements I could do something like this pseudo-code:
new_soup = []
for tag in soup:
if has_bullet(tag):
# start storing tags
bullets.append(tag)
else:
if bullets: # if we have some bullets to dump
new_soup.append(ul_tag_start)
new_soup.extend(modify_text(bullets))
new_soup.append(ul_tag_end)
new_soup.append(tag)
# clear bullets list
bullets = []
but I don't know to write a new soup element by element, and I wonder if there's a better way using bs4's various insert, insert_before, insert_after, etc. methods.
Upvotes: 1
Views: 110
Reputation: 71471
You can use recursion:
import bs4, re
from bs4 import BeautifulSoup as soup
test = """<h1>A header</h1><p> Just a normal paragraph <b>before</b> </p><p>• test element</p><p>• test element2<b>something else</b></p><p> Following <i>stuff</i></p>"""
def form_ul(d):
return soup('<ul>{}</ul>'.format('\n'.join(f'<li>{i}</li>' for i in d)), 'html.parser').ul
def to_ul(d):
c,l = [],[]
for i in d.contents:
if isinstance(i, bs4.NavigableString):
c.append(i)
else:
if str(i.get_text(strip=True)).startswith(u'\u2022'):
l.append('\n'.join(j.replace(u'\u2022 ', '') if isinstance(j, bs4.NavigableString) else str(j) for j in i.contents))
else:
if l:
c.append(form_ul(l))
l = []
to_ul(i)
c.append(i)
if l:
c.append(form_ul(l))
d.contents = [j for j in c if not re.findall('^\n+$', str(j))]
html = soup(test, 'html.parser')
to_ul(html)
print(soup.prettify(html))
Output:
<h1>
A header
</h1>
<p>
Just a normal paragraph
<b>
before
</b>
</p>
<ul>
<li>
test element
</li>
<li>
test element2
<b>
something else
</b>
</li>
</ul><p>
Following
<i>
stuff
</i>
</p>
Upvotes: 1