Reputation: 83
I'd like to know if there's a library or some method in Python to extract an element from an HTML document. For example:
I have this document:
<html>
<head>
...
</head>
<body>
<div>
...
</div>
</body>
</html>
I want to remove the <div></div>
tag block along with the block contents from the document and then it'll be like that:
<html>
<head>
...
</head>
<body>
</body>
</html>
Upvotes: 3
Views: 4562
Reputation: 302
You don't need a library for this. Just use built in string methods.
def removeOneTag(text, tag):
return text[:text.find("<"+tag+">")] + text[text.find("</"+tag+">") + len(tag)+3:]
This will remove everything in-between the first opening and closing tag. So your input in the example would be something like...
x = """<html>
<head>
...
</head>
<body>
<div>
...
</div>
</body>
</html>"""
print(removeOneTag(x, "div"))
Then if you wanted to remove ALL the tags...
while(tag in x):
x = removeOneTag(x, tag)
Upvotes: 7
Reputation: 1442
I personally feel that you don't need a library or something.
You can simply write a python script to read the html file and a regex to match your desired html tags and then do whatever you want to with it (delete in your case)
Though, there exist a library for the same.
See the official documentation -> https://docs.python.org/2/library/htmlparser.html
Also see this -> Extracting text from HTML file using Python
Upvotes: 0
Reputation: 420
Try using a HTML parser such as BeautifulSoup to select the <div>
DOM element. Then you can remove it using regex or similar.
Upvotes: -2