Reputation: 797
I have raw HTML files and i remove script tag.
I want to identify in the DOM the block elements (like <h1> <p> <div>
etc, not <a> <em> <b>
etc) and enclose them in <div>
tags.
Is there any easy way to do it in python? is there library in python to identify the block element
Thanks
UPDATE
actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.
Upvotes: 3
Views: 537
Reputation: 135
You should use something like Beautiful Soup or HTMLParser.
Have a look at their docs: Beautiful Soup or HTMLParser.
You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.
Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:
soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so
Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:
def walker(soup):
if soup.name is not None:
for child in soup.children:
# do stuff with the node
print ':'.join([str(child.name), str(type(child))])
walker(child)
Note: the code is from this great tutorial.
Upvotes: 3