Delete a certain tag with a certain id content from an HTML using python BeautifulSoup

Question

I got a suggestion to use BeautifulSoup to delete a tag with a certain id from an HTML. For example, deleting

...

Below is my code, but doesn't seem to be working correctly:

import os, re
from bs4 import BeautifulSoup

cwd = os.getcwd()
print ('Now you are at this directory: 
' + cwd)

# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files

def func(file):
    for file in os.listdir(cwd):
        if file.endswith('.html'):
            print ('HTML files are 
' + file)
            f = open(file, "r+")
            soup = BeautifulSoup(f, 'html.parser')
                matches  = str(soup.find_all("div", id="jp-post-flair"))
                #The soup.find_all part should be correct as I tested it to             
                #print the matches and the result matches the texts I want to delete.
                f.write(f.read().replace(matches,''))
                #maybe the above line isn't correct
            f.close()
func(file)

Would you help check which part has the wrong code and maybe how should I approach it? Thank you very much!!

Josh Crozier · Accepted Answer

You can use the .decompose() method to remove the element/tag:

f = open(file, "r+")

soup = BeautifulSoup(f, 'html.parser')
elements = soup.find_all("div", id="jp-post-flair")
for element in elements:
  element.decompose()

f.write(str(soup))

It's also worth mentioning that you can probably just use the .find() method because an id attribute should be unique within a document (which means that there will likely only be one element in most cases):

f = open(file, "r+")

soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.find("div", id="jp-post-flair")
if element:
  element.decompose()

f.write(str(soup))

As an alternative, based on the comments below:

If you only want to parse and modify part of the document, BeautifulSoup has a SoupStrainer class that allows you to selectively parse parts of the document.
You mentioned that the indentations and formatting in the HTML file was being changing. Rather than just converting the soup object directly into a string, you can check out the relevant output formatting section in the documentation.

Depending on the desired output, here are a few potential options:
- soup.prettify(formatter="minimal")
- soup.prettify(formatter="html")
- soup.prettify(formatter=None)

Delete a certain tag with a certain id content from an HTML using python BeautifulSoup

Answers (1)

Related Questions