Reputation: 109
I'm trying to use Beautifuloup to extract html tags and delete the text. For example take this html:
html_page = """
<html>
<body>
<table>
<tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr>
<tr class=tb1><td>Consectetuer adipiscing elit</td></tr>
<tr><td>Aliquam Tincidunt mauris eu Risus</td></tr>
<tr><td>Vestibulum Auctor Dapibus neque</td></tr>
</table>
</body>
</html>
"""
The desired result is:
<html>
<body>
<table>
<tr><td></td></tr>
<tr><td></td></tr>
<tr><td></td></tr>
<tr><td></td></tr>
</table>
</body>
</html>
Here's what I've got so far:
def get_tags(soup):
copy_soup = soup
for tag in copy_soup.findAll(True):
tag.attrs = {} # removes attributes of a tag
tag.string = ''
return copy_soup
print get_tags(soup)
Using tag.attrs = {} works for removing all tag attributes. But when I try using tag.string or tag.clear() I'm just left with <html></html>
. I understand that what is probably happening is on the first iteration using tag.string
or tag.clear()
is removing all contents within the html tags.
I'm unsure how to remedy this. Perhaps recursively delete text from children first? Or is there a simpler approach I'm missing?
Upvotes: 2
Views: 3318
Reputation: 71
Actually, I was able to delete the text by recursively updating the tag children's. You can also update their attributes in the recursion.
from bs4 import BeautifulSoup
from bs4.element import NavigableString
def delete_displayed_text(element):
"""
delete displayed text from beautiful soup tag element object recursively
:param element: beautiful soup tag element object
:return: beautiful soup tag element object
"""
new_children = []
for child in element.contents:
if not isinstance(child, NavigableString):
new_children.append(delete_displayed_text(child))
element.contents = new_children
return element
if __name__ =='__main__':
html_code_sample = '<div class="hello">I am not supposed to be displayed<a>me neither</a></div>'
soup = BeautifulSoup(html_code_sample, 'html.parser')
soup = delete_displayed_text(soup)
cleaned_soup = BeautifulSoup(str(soup), 'html.parser')
print(cleaned_soup.getText())
Upvotes: 4
Reputation: 473763
You cannot simply reset .string
to an empty string since, if an element has a single child with text, like tr
elements in your example, you would unintentionally remove the td
elements from the tree.
You cannot use .clear()
since it recursively removes all the child nodes as well.
I don't recall a built-in way to get the HTML tree structure without the data in BeautifulSoup
- I'd use the following approach:
for elm in soup.find_all():
if not elm.find(recursive=False): # if not children
elm.string = ''
elm.attrs = {}
Here we are resetting the .string
only if there are no children.
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> html_page = """
... <html>
... <body>
... <table>
... <tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr>
... <tr class=tb1><td>Consectetuer adipiscing elit</td></tr>
... <tr><td>Aliquam Tincidunt mauris eu Risus</td></tr>
... <tr><td>Vestibulum Auctor Dapibus neque</td></tr>
... </table>
... </body>
... </html>
... """
>>>
>>> soup = BeautifulSoup(html_page, "html.parser")
>>> for elm in soup.find_all():
... if not elm.find(recursive=False):
... elm.string = ''
... elm.attrs = {}
...
>>> print(soup.prettify())
<html>
<body>
<table>
<tr>
<td>
</td>
</tr>
<tr>
<td>
</td>
</tr>
<tr>
<td>
</td>
</tr>
<tr>
<td>
</td>
</tr>
</table>
</body>
</html>
Upvotes: 3