Reputation: 767
I started to learn the beautifulsoup. I am trying to remove from html script a line of code containing </div>
.
The most examples in the documentation are presented for the whole tags (opening and closing part).
Is it possible to modify just one part of a tag?
For example:
</div>
<div >Hello</div>
<div data-foo="value">foo!</div>
how to remove just the first line of the code?
Upvotes: 3
Views: 1199
Reputation: 19154
you don't need do anything it will repaired automatically
from bs4 import BeautifulSoup
html_doc = '''</div>
<div>World</div>
<div data-foo="value">foo!''' # also invalid, no closing
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)
output
<div>World</div>
<div data-foo="value">foo!</div>
unwrap()
is for removing not repairing tag.
Upvotes: 1
Reputation: 8047
You can use BeautifulSoup's unwrap()
to specify the invalid tag, which will only remove the extra tags that don't have a open/close counterpart, while retaining others:
soup = BeautifulSoup(html_doc, 'html.parser')
invalid_tags = ['</div>']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.unwrap()
print(soup)
result:
<div>Hello</div>
<div data-foo="value">foo!</div>
Upvotes: 3