Penny
Penny

Reputation: 1280

Delete a certain tag with a certain id content from an HTML using python BeautifulSoup

I got a suggestion to use BeautifulSoup to delete a tag with a certain id from an HTML. For example, deleting <div id=needDelete>...</div> Below is my code, but doesn't seem to be working correctly:

import os, re
from bs4 import BeautifulSoup

cwd = os.getcwd()
print ('Now you are at this directory: \n' + cwd)

# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files

def func(file):
    for file in os.listdir(cwd):
        if file.endswith('.html'):
            print ('HTML files are \n' + file)
            f = open(file, "r+")
            soup = BeautifulSoup(f, 'html.parser')
                matches  = str(soup.find_all("div", id="jp-post-flair"))
                #The soup.find_all part should be correct as I tested it to             
                #print the matches and the result matches the texts I want to delete.
                f.write(f.read().replace(matches,''))
                #maybe the above line isn't correct
            f.close()
func(file)

Would you help check which part has the wrong code and maybe how should I approach it? Thank you very much!!

Upvotes: 0

Views: 2538

Answers (1)

Josh Crozier
Josh Crozier

Reputation: 240928

You can use the .decompose() method to remove the element/tag:

f = open(file, "r+")

soup = BeautifulSoup(f, 'html.parser')
elements = soup.find_all("div", id="jp-post-flair")
for element in elements:
  element.decompose()

f.write(str(soup))

It's also worth mentioning that you can probably just use the .find() method because an id attribute should be unique within a document (which means that there will likely only be one element in most cases):

f = open(file, "r+")

soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.find("div", id="jp-post-flair")
if element:
  element.decompose()

f.write(str(soup))

As an alternative, based on the comments below:

  • If you only want to parse and modify part of the document, BeautifulSoup has a SoupStrainer class that allows you to selectively parse parts of the document.

  • You mentioned that the indentations and formatting in the HTML file was being changing. Rather than just converting the soup object directly into a string, you can check out the relevant output formatting section in the documentation.

    Depending on the desired output, here are a few potential options:

    • soup.prettify(formatter="minimal")
    • soup.prettify(formatter="html")
    • soup.prettify(formatter=None)

Upvotes: 3

Related Questions