Reputation: 87
<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<div><span id="company">Apple</span> Chats:</div>
<div>abcdefg<span>xvfdadsad</span>sdfsdfsdf</div>
<div>
<li>(<span>7</span>sadsafasf<span>vdvdsfdsfds</span></li>
<li>(<span>8</span>) <span>Reim</span></li>
</div>
<!-- Ad -->
<a href="#">
I want to remove all contents between the two comment lines using bs4 and make the file into something like:
<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<!-- Ad -->
<a href="#">
Upvotes: 2
Views: 463
Reputation: 56885
First of all, be careful with snippets of HTML taken out of context. If you print your soupified snippet, you'll get:
<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<html>
<body>
<div>
<span id="company">
...
Whoops--BS added the comment above the <html>
tag, pretty clearly not your intent as an algorithm to remove elements between the two tags would inevitably remove the entire document (that's why including your code is important...).
As for the main task, element.decompose()
or element.extract()
will remove it from the tree (extract()
returns it, minor subtlety). Elements to be removed in a walk need to be kept in a separate list and removed after the traversal ends.
from bs4 import BeautifulSoup, Comment
html = """
<body>
<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<div><span id="company">Apple</span> Chats:</div>
<div>abcdefg<span>xvfdadsad</span>sdfsdfsdf</div>
<div>
<li>(<span>7</span>sadsafasf<span>vdvdsfdsfds</span></li>
<li>(<span>8</span>) <span>Reim</span></li>
</div>
<!-- Ad -->
<a href="#">
"""
start_comment = " Top Plans & Programs: Most Common User Phrases - List Bucket 6 "
end_comment = " Ad "
soup = BeautifulSoup(html, "lxml")
to_extract = []
between_comments = False
for x in soup.recursiveChildGenerator():
if between_comments and not isinstance(x, str):
to_extract.append(x)
if isinstance(x, Comment):
if start_comment == x:
between_comments = True
elif end_comment == x:
break
for x in to_extract:
x.decompose()
print(soup.prettify())
Output:
<html>
<body>
<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<!-- Ad -->
<a href="#">
</a>
</body>
</html>
Note that if the ending comment isn't at the same level as the starting comment, this will destroy all parent elements of the ending comment. If you don't want that, you'll need to walk back up the parent chain until you reach the level of the starting comment.
Another solution using .find
and .next
(same imports/HTML string/output as above):
start_comment = " Top Plans & Programs: Most Common User Phrases - List Bucket 6 "
end_comment = " Ad "
soup = BeautifulSoup(html, "lxml")
el = soup.find(text=lambda x: isinstance(x, Comment) and start_comment == x)
end = el.find_next(text=lambda x: isinstance(x, Comment) and end_comment == x)
to_extract = []
while el and end and el is not end:
if not isinstance(el, str):
to_extract.append(el)
el = el.next
for x in to_extract:
x.decompose()
print(soup.prettify())
Upvotes: 1
Reputation: 20008
You can remove the div
's using the .decompose()
method. Since the comments are of type Comment
, BeautifulSoup
won't see them, so find_all()
div's:
# Find all the elements after the tag with `id="company"`
for tag in soup.find("span", id="company").next_elements:
# Break once we encounter an `a` since all the comments have finished
if tag.name == "a":
break
else:
try:
tag.previous_sibling.decompose()
except AttributeError:
continue
print(soup.prettify())
Output:
<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<!-- Ad -->
<a href="#">
</a>
Upvotes: 0