JefersonM
JefersonM

Reputation: 83

Remove HTML block in Python

I'd like to know if there's a library or some method in Python to extract an element from an HTML document. For example:

I have this document:

<html>
      <head>
          ...
      </head>
      <body>
          <div>
           ...
          </div>
      </body>
</html>

I want to remove the <div></div> tag block along with the block contents from the document and then it'll be like that:

<html>
    <head>
     ...
    </head>
    <body>
    </body>
</html>

Upvotes: 3

Views: 4562

Answers (3)

Wso
Wso

Reputation: 302

You don't need a library for this. Just use built in string methods.

def removeOneTag(text, tag):
    return text[:text.find("<"+tag+">")] + text[text.find("</"+tag+">") + len(tag)+3:]

This will remove everything in-between the first opening and closing tag. So your input in the example would be something like...

    x = """<html>
    <head>
      ...
    </head>
    <body>
       <div>
         ...
       </div>
    </body>
</html>"""
print(removeOneTag(x, "div"))

Then if you wanted to remove ALL the tags...

while(tag in x):
    x = removeOneTag(x, tag)

Upvotes: 7

Ankush Raghuvanshi
Ankush Raghuvanshi

Reputation: 1442

I personally feel that you don't need a library or something.

You can simply write a python script to read the html file and a regex to match your desired html tags and then do whatever you want to with it (delete in your case)

Though, there exist a library for the same.

See the official documentation -> https://docs.python.org/2/library/htmlparser.html

Also see this -> Extracting text from HTML file using Python

Upvotes: 0

Frangipanes
Frangipanes

Reputation: 420

Try using a HTML parser such as BeautifulSoup to select the <div> DOM element. Then you can remove it using regex or similar.

Upvotes: -2

Related Questions