Reputation: 1280
I got a suggestion to use BeautifulSoup to delete a tag with a certain id from an HTML. For example, deleting <div id=needDelete>...</div>
Below is my code, but doesn't seem to be working correctly:
import os, re
from bs4 import BeautifulSoup
cwd = os.getcwd()
print ('Now you are at this directory: \n' + cwd)
# find files that have an extension with HTML
Files = os.listdir(cwd)
print Files
def func(file):
for file in os.listdir(cwd):
if file.endswith('.html'):
print ('HTML files are \n' + file)
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
matches = str(soup.find_all("div", id="jp-post-flair"))
#The soup.find_all part should be correct as I tested it to
#print the matches and the result matches the texts I want to delete.
f.write(f.read().replace(matches,''))
#maybe the above line isn't correct
f.close()
func(file)
Would you help check which part has the wrong code and maybe how should I approach it? Thank you very much!!
Upvotes: 0
Views: 2538
Reputation: 240928
You can use the .decompose()
method to remove the element/tag:
f = open(file, "r+")
soup = BeautifulSoup(f, 'html.parser')
elements = soup.find_all("div", id="jp-post-flair")
for element in elements:
element.decompose()
f.write(str(soup))
It's also worth mentioning that you can probably just use the .find()
method because an id
attribute should be unique within a document (which means that there will likely only be one element in most cases):
f = open(file, "r+")
soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.find("div", id="jp-post-flair")
if element:
element.decompose()
f.write(str(soup))
As an alternative, based on the comments below:
If you only want to parse and modify part of the document, BeautifulSoup has a SoupStrainer
class that allows you to selectively parse parts of the document.
You mentioned that the indentations and formatting in the HTML file was being changing. Rather than just converting the soup
object directly into a string, you can check out the relevant output formatting section in the documentation.
Depending on the desired output, here are a few potential options:
soup.prettify(formatter="minimal")
soup.prettify(formatter="html")
soup.prettify(formatter=None)
Upvotes: 3