sorin
sorin

Reputation: 170478

How to parse and extract a specific element from a HTML document in Python?

There are lots of XML and HTML parsers in Python and I am looking for a simple way to extract a section of a HTML document, preferably using an XPATH construct but that's only optional.

Here is an example

src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"

I want to extract the entire body of the element with id=content, so the result should be: <div id=content>AAA<B>BBB</B>CCC</div>

It would be if I can do this without installing a new library.

I would also prefer to get the original content of the desired element (not reformatted).

Usage of regexp is not allowed, as these are not safe for parsing XML/HTML.

Upvotes: 0

Views: 500

Answers (2)

citizen2191629
citizen2191629

Reputation: 74

Yea I have done this. It may not be the best way to do it but it works something like the code below. I didn't test this

import re

match = re.finditer("<div id=content>",src)
src = src[match.start():]

#at this point the string start with your div everything proceeding it has been stripped.
#This next part works because the first div in the string is the end of your div section.
match = re.finditer("</div>",src)
src = src[:match.end()]

src now has just the div your after in the string. If there are situations where there is another inside what you want you will just have to build a fancier search pattern for you re.finditer sections.

Upvotes: 0

Kalyan02
Kalyan02

Reputation: 1434

To parse using a library - the best way is BeautifulSoup Here is a snippet of how it will work for you!

from BeautifulSoup import BeautifulSoup

src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"
soupy = BeautifulSoup( src )

content_divs = soupy.findAll( attrs={'id':'content'} )
if len(content_divs) > 0:
    # print the first one
    print str(content_divs[0])

    # to print the text contents
    print content_divs[0].text

    # or to print all the raw html
    for each in content_divs:
        print each

Upvotes: 1

Related Questions