How do I get all content between two html tags in Python?

Question

I try to extract all content (tags and text) from one main tag on html page. For example:

`my_html_page = '''

    
       
          
             Some text
             another text
          
          hello world
          
              some text inside p
             

                one li
                second li
             
         
         some text 2
         
             text inside div
         
         some text 3
      
      
          text inside second main div
      
      
          third div
      
      
          four div
      
      
          other text
      
  
'''`

And I need to get using xpath("(//div[@class="post_body"])[1]"):

`
       
          
             Some text
             another text
          
          hello world
          
              some text inside p
             

                one li
                second li
             
         
         some text 2
         
             text inside div
         
         some text 3
      
`

All inside tag

I read this topic, but it did not help.

I need to create DOM by beautifulsoup parser in lxml.

 import lxml.html.soupparser
 import lxml.html
 text_inside_tag = lxml.html.soupparser.fromstring(my_html_page)
 text = text_inside_tag.xpath('(//div[@class="post_body"])[1]/text()')

And i can extract only text inside tag, but I need extract text with tags.

If i tried use this:

for elem in text.xpath("(//div[@class="post_body"])[1]/text()"):
   print lxml.html.tostring(elem, pretty_print=True)

I have error: TypeError: Type '_ElementStringResult' cannot be serialized.

Help, please.

har07 · Accepted Answer

You can try this way :

import lxml.html.soupparser
import lxml.html

my_html_page = '''...some html markup here...'''
root = lxml.html.soupparser.fromstring(my_html_page)

for elem in root.xpath("//div[@class='post_body']"):
    result = elem.text + ''.join(lxml.html.tostring(e, pretty_print=True) for e in elem)
    print result

result variable constructed by combining text nodes within parent

with markup of all of the child nodes.

How do I get all content between two html tags in Python?

Answers (1)

Related Questions