Kim Hyesung
Kim Hyesung

Reputation: 797

python how to identify block html contain text?

I have raw HTML files and i remove script tag.

I want to identify in the DOM the block elements (like <h1> <p> <div> etc, not <a> <em> <b> etc) and enclose them in <div> tags.

Is there any easy way to do it in python? is there library in python to identify the block element

Thanks

UPDATE

actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.

Upvotes: 3

Views: 537

Answers (1)

dendragon
dendragon

Reputation: 135

You should use something like Beautiful Soup or HTMLParser.

Have a look at their docs: Beautiful Soup or HTMLParser.

You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.

Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:

soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so

Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:

def walker(soup):
    if soup.name is not None:
        for child in soup.children:
            # do stuff with the node
            print ':'.join([str(child.name), str(type(child))])
            walker(child)

Note: the code is from this great tutorial.

Upvotes: 3

Related Questions