Milano
Milano

Reputation: 18705

LXML - remove element's parents while keeping the element

I want to remove elements that wrap <img> tags but keep it inside <article> element:

So from this:

<article> 
<p>...
<br>
<br>
<strong>...
<span>
    <div>
        <img ....
    </div>
</span>
<p>...

Make this:

<article> 
<p>...
<br>
<br>
<strong>...
<img ....
<p>...

Without knowing how many and what tags is the <img> nested.

I was thinking about finding the first ancestor before <article>, remove it and append the copy of <img> tag.

The problem is that append adds it at the bottom of the article.

Do you know how to do that?

Upvotes: 2

Views: 255

Answers (3)

Daniel Haley
Daniel Haley

Reputation: 52858

Here's another lxml option using addnext()

Python

from lxml import html

html_string = """
<article> 
 <p></p>
 <br>
 <br>
 <strong></strong>
 <span>
   <div>
      <img src='http://something.com'>
   </div>
 </span>
 <p></p>
</article>
"""

root = html.fromstring(html_string)

for ancestor in root.xpath("/html/body/article/*[.//img]"):
    for img in ancestor.xpath(".//img"):
        ancestor.addnext(img)
    ancestor.getparent().remove(ancestor)

Printed Output

<article> 
 <p></p>
 <br>
 <br>
 <strong></strong>
 <img src="http://something.com">
   
 <p></p>
</article>

Upvotes: 1

niko
niko

Reputation: 5281

Using xpath to check if a node contains img and tostring could be interesting for your use-case:

import lxml.html


root = lxml.html.fromstring("""
 <article> 
 <p></p>
 <br>
 <br>
 <strong></strong>
 <span>
   <div>
      <img src='http://something.com'>
   </div>
 </span>
 <p></p>
 </article>
""")

newroot = []
for _ in root:
    imgs = _.xpath(".//img")
    newroot.extend(imgs or [_])

sourcecode = "".join(lxml.html.tostring(_).decode() for _ in newroot)
"""
<p></p>
 <br>
 <br>
 <strong></strong>
 <img src="http://something.com">
   <p></p>
"""

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

You can use soup.insert with recursion:

import bs4
from bs4 import BeautifulSoup as soup
def has_img(d):
  if d.name == 'img':
     return d
  return None if len((j:=[i for i in getattr(d, 'contents', []) if isinstance(i, bs4.element.Tag)])) != 1 \
              else has_img(j[0])

def remove_wrapping(d):
   for i, a in enumerate(getattr(d, 'contents', [])):
      if a.name != 'img' and (img:=has_img(a)) is not None:
         a.extract()
         d.insert(i, img)
      else:
         remove_wrapping(a)

s = """
 <article> 
 <p></p>
 <br>
 <br>
 <strong></strong>
 <span>
   <div>
      <img src='http://something.com'>
   </div>
 </span>
 <p></p>
 </article>
"""
d = soup(s, 'html.parser').article
remove_wrapping(d)
print(d)

Output:

<article>
<p></p>
<br/>
<br/>
<strong></strong>
<img src="http://something.com"/>
<p></p>
</article>

Upvotes: 0

Related Questions