Sasha Kucher
Sasha Kucher

Reputation: 21

Python - How to Remove (Delete) Unclosed Tags

looking for a way to remove open unpaired tags! BS4 as well as lxml are good at removing unpaired closed tags. But if they find an open tag, they try to close it, and close it at the very end :(

Example

from bs4 import BeautifulSoup
import lxml.html

codeblock = '<strong>Good</strong> Some text and bad closed strong </strong> Some text and bad open strong PROBLEM HERE <strong> Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p>'

soup = BeautifulSoup(codeblock, "html.parser").prettify()
print(soup)

root = lxml.html.fromstring(codeblock)
res = lxml.html.tostring(root)
print(res)

Output bs4:

<strong>
 Good
</strong>
Some text and bad closed strong
Some text and bad open strong PROBLEM HERE
<strong>
 Some text
 <h2>
  Some
 </h2>
 or
 <h3>
  Some
 </h3>
 <p>
  Some Some text
  <strong>
   Good2
  </strong>
 </p>
</strong>

Output lxml:

b'<div><strong>Good</strong> Some text and bad closed strong  Some text and bad open strong PROBLEM HERE <strong> Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p></strong></div>'

  1. I would be fine if the tag is closed before the first following tag, here in the example of H2
PROBLEM HERE <strong> Some text </strong><h2>Some</h2>
  1. I would also be ok with removing this open tag <strong>

But the fact that it closes at the very end - this is a problem!

In the real code the index (position) of the tag <strong> is not known!

What are the solutions?

I tried to do it with BS4 and lxml but it didn't work! If you know the solution, please help!

Upvotes: 2

Views: 394

Answers (2)

Sasha Kucher
Sasha Kucher

Reputation: 21

as a temporary solution, decided to remove <strong> tags that have children

from bs4 import BeautifulSoup

codeblock = '<strong>Good</strong> Some text and bad closed strong </strong> Some text and bad open strong PROBLEM HERE <strong> Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p>'

soup = BeautifulSoup(codeblock, "html.parser")
# pretty = soup.prettify()
for item in soup.find_all('strong'):
    if item.findChild():
        item.unwrap()
print(soup)

Print:

<strong>Good</strong> Some text and bad closed strong  Some text and bad open strong PROBLEM HERE  Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p>

If you see a better solution, please write...

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195553

Maybe the solution can be .unwrap() the second <strong> tag:

codeblock = "<strong>Good</strong> Some text and bad closed strong </strong> Some text and bad open strong PROBLEM HERE <strong> Some text <h2>Some</h2> or <h3>Some</h3> <p>Some Some text <strong>Good2</strong></p>"

soup = BeautifulSoup(codeblock, "html.parser")
soup.select("strong")[1].unwrap()

print(soup.prettify())

Prints:

<strong>
 Good
</strong>
Some text and bad closed strong
Some text and bad open strong PROBLEM HERE
Some text
<h2>
 Some
</h2>
or
<h3>
 Some
</h3>
<p>
 Some Some text
 <strong>
  Good2
 </strong>
</p>

Upvotes: 1

Related Questions