hannahbanana2.0
hannahbanana2.0

Reputation: 109

beautiful soup extract tags delete text

I'm trying to use Beautifuloup to extract html tags and delete the text. For example take this html:

html_page = """
<html>
<body>
<table>
<tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr>
<tr class=tb1><td>Consectetuer adipiscing elit</td></tr>
<tr><td>Aliquam Tincidunt mauris eu Risus</td></tr>
<tr><td>Vestibulum Auctor Dapibus neque</td></tr>
</table>
</body>
</html>
"""

The desired result is:

<html>
<body>
<table>
<tr><td></td></tr>
<tr><td></td></tr>
<tr><td></td></tr>
<tr><td></td></tr>
</table>
</body>
</html>

Here's what I've got so far:

def get_tags(soup):
    copy_soup = soup
    for tag in copy_soup.findAll(True):
        tag.attrs = {} # removes attributes of a tag
        tag.string = ''

    return copy_soup

print get_tags(soup)

Using tag.attrs = {} works for removing all tag attributes. But when I try using tag.string or tag.clear() I'm just left with <html></html>. I understand that what is probably happening is on the first iteration using tag.string or tag.clear() is removing all contents within the html tags.

I'm unsure how to remedy this. Perhaps recursively delete text from children first? Or is there a simpler approach I'm missing?

Upvotes: 2

Views: 3318

Answers (2)

Slim Frikha
Slim Frikha

Reputation: 71

Actually, I was able to delete the text by recursively updating the tag children's. You can also update their attributes in the recursion.

from bs4 import BeautifulSoup
from bs4.element import NavigableString

def delete_displayed_text(element):
    """
    delete displayed text from beautiful soup tag element object recursively
    :param element: beautiful soup tag element object
    :return: beautiful soup tag element object
    """
    new_children = []
    for child in element.contents:
        if not isinstance(child, NavigableString):
            new_children.append(delete_displayed_text(child))
    element.contents = new_children
    return element

if __name__ =='__main__':
    html_code_sample = '<div class="hello">I am not supposed to be displayed<a>me neither</a></div>'
    soup = BeautifulSoup(html_code_sample, 'html.parser')
    soup = delete_displayed_text(soup)
    cleaned_soup = BeautifulSoup(str(soup), 'html.parser')
    print(cleaned_soup.getText())

Upvotes: 4

alecxe
alecxe

Reputation: 473763

You cannot simply reset .string to an empty string since, if an element has a single child with text, like tr elements in your example, you would unintentionally remove the td elements from the tree.

You cannot use .clear() since it recursively removes all the child nodes as well.

I don't recall a built-in way to get the HTML tree structure without the data in BeautifulSoup - I'd use the following approach:

for elm in soup.find_all():
    if not elm.find(recursive=False):  # if not children
        elm.string = ''
    elm.attrs = {}

Here we are resetting the .string only if there are no children.

Demo:

>>> from bs4 import BeautifulSoup
>>> 
>>> html_page = """
... <html>
... <body>
... <table>
... <tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr>
... <tr class=tb1><td>Consectetuer adipiscing elit</td></tr>
... <tr><td>Aliquam Tincidunt mauris eu Risus</td></tr>
... <tr><td>Vestibulum Auctor Dapibus neque</td></tr>
... </table>
... </body>
... </html>
... """
>>> 
>>> soup = BeautifulSoup(html_page, "html.parser")
>>> for elm in soup.find_all():
...     if not elm.find(recursive=False):
...         elm.string = ''
...     elm.attrs = {}
... 
>>> print(soup.prettify())
<html>
 <body>
  <table>
   <tr>
    <td>
    </td>
   </tr>
   <tr>
    <td>
    </td>
   </tr>
   <tr>
    <td>
    </td>
   </tr>
   <tr>
    <td>
    </td>
   </tr>
  </table>
 </body>
</html>

Upvotes: 3

Related Questions