Reputation: 18705
I want to remove elements that wrap <img>
tags but keep it inside <article>
element:
So from this:
<article>
<p>...
<br>
<br>
<strong>...
<span>
<div>
<img ....
</div>
</span>
<p>...
Make this:
<article>
<p>...
<br>
<br>
<strong>...
<img ....
<p>...
Without knowing how many and what tags is the <img>
nested.
I was thinking about finding the first ancestor before <article>
, remove it and append the copy of <img>
tag.
The problem is that append
adds it at the bottom of the article.
Do you know how to do that?
Upvotes: 2
Views: 255
Reputation: 52858
Here's another lxml option using addnext()
Python
from lxml import html
html_string = """
<article>
<p></p>
<br>
<br>
<strong></strong>
<span>
<div>
<img src='http://something.com'>
</div>
</span>
<p></p>
</article>
"""
root = html.fromstring(html_string)
for ancestor in root.xpath("/html/body/article/*[.//img]"):
for img in ancestor.xpath(".//img"):
ancestor.addnext(img)
ancestor.getparent().remove(ancestor)
Printed Output
<article>
<p></p>
<br>
<br>
<strong></strong>
<img src="http://something.com">
<p></p>
</article>
Upvotes: 1
Reputation: 5281
Using xpath
to check if a node contains img
and tostring
could be interesting for your use-case:
import lxml.html
root = lxml.html.fromstring("""
<article>
<p></p>
<br>
<br>
<strong></strong>
<span>
<div>
<img src='http://something.com'>
</div>
</span>
<p></p>
</article>
""")
newroot = []
for _ in root:
imgs = _.xpath(".//img")
newroot.extend(imgs or [_])
sourcecode = "".join(lxml.html.tostring(_).decode() for _ in newroot)
"""
<p></p>
<br>
<br>
<strong></strong>
<img src="http://something.com">
<p></p>
"""
Upvotes: 1
Reputation: 71451
You can use soup.insert
with recursion:
import bs4
from bs4 import BeautifulSoup as soup
def has_img(d):
if d.name == 'img':
return d
return None if len((j:=[i for i in getattr(d, 'contents', []) if isinstance(i, bs4.element.Tag)])) != 1 \
else has_img(j[0])
def remove_wrapping(d):
for i, a in enumerate(getattr(d, 'contents', [])):
if a.name != 'img' and (img:=has_img(a)) is not None:
a.extract()
d.insert(i, img)
else:
remove_wrapping(a)
s = """
<article>
<p></p>
<br>
<br>
<strong></strong>
<span>
<div>
<img src='http://something.com'>
</div>
</span>
<p></p>
</article>
"""
d = soup(s, 'html.parser').article
remove_wrapping(d)
print(d)
Output:
<article>
<p></p>
<br/>
<br/>
<strong></strong>
<img src="http://something.com"/>
<p></p>
</article>
Upvotes: 0