Reputation: 170478
There are lots of XML and HTML parsers in Python and I am looking for a simple way to extract a section of a HTML document, preferably using an XPATH construct but that's only optional.
Here is an example
src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"
I want to extract the entire body of the element with id=content, so the result should be: <div id=content>AAA<B>BBB</B>CCC</div>
It would be if I can do this without installing a new library.
I would also prefer to get the original content of the desired element (not reformatted).
Usage of regexp is not allowed, as these are not safe for parsing XML/HTML.
Upvotes: 0
Views: 500
Reputation: 74
Yea I have done this. It may not be the best way to do it but it works something like the code below. I didn't test this
import re
match = re.finditer("<div id=content>",src)
src = src[match.start():]
#at this point the string start with your div everything proceeding it has been stripped.
#This next part works because the first div in the string is the end of your div section.
match = re.finditer("</div>",src)
src = src[:match.end()]
src now has just the div your after in the string. If there are situations where there is another inside what you want you will just have to build a fancier search pattern for you re.finditer sections.
Upvotes: 0
Reputation: 1434
To parse using a library - the best way is BeautifulSoup Here is a snippet of how it will work for you!
from BeautifulSoup import BeautifulSoup
src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"
soupy = BeautifulSoup( src )
content_divs = soupy.findAll( attrs={'id':'content'} )
if len(content_divs) > 0:
# print the first one
print str(content_divs[0])
# to print the text contents
print content_divs[0].text
# or to print all the raw html
for each in content_divs:
print each
Upvotes: 1