Chris
Chris

Reputation: 767

python - beautifulsoup - removing a line of code

I started to learn the beautifulsoup. I am trying to remove from html script a line of code containing </div> .

The most examples in the documentation are presented for the whole tags (opening and closing part).
Is it possible to modify just one part of a tag? For example:

</div>
<div >Hello</div>
<div data-foo="value">foo!</div>


how to remove just the first line of the code?

Upvotes: 3

Views: 1199

Answers (2)

ewwink
ewwink

Reputation: 19154

you don't need do anything it will repaired automatically

from bs4 import BeautifulSoup

html_doc = '''</div> 
<div>World</div>
<div data-foo="value">foo!''' # also invalid, no closing

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)

output

<div>World</div>
<div data-foo="value">foo!</div>

unwrap() is for removing not repairing tag.

Upvotes: 1

You can use BeautifulSoup's unwrap() to specify the invalid tag, which will only remove the extra tags that don't have a open/close counterpart, while retaining others:

soup = BeautifulSoup(html_doc, 'html.parser')

invalid_tags = ['</div>']

for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.unwrap()

print(soup)

result:

<div>Hello</div>
<div data-foo="value">foo!</div>

Upvotes: 3

Related Questions