Extract text with lxml

Question

I have this text :

INTRODUCTION
This is a test document for xml.
I need to extract this sentence.

Conclusion
It should hopefully..

The line I need to extract this sentence. is in italics . The xml of the file looks like:




   
      
         
            
               
            
         
         
            
               
            
            INTRODUCTION
         
      
      
         
            
               
            
         
         
            
               
            
            This is a test document for xml.
         
      
      
         
            
               
            
            I need to extract this sentence.
         
         
         
      
      
      
         
            
               
            
         
         
            
               
            
            Conclusion
         
      
      
         
            It should hopefully
         
         
            ..

I tried :

tree = ET.parse(doc_xml)  
[b.tag for b in tree.iterfind(".//i")]

The above returns an empty list.

I've searched a lot but wasn't able to figure out how to do that as the text is contained within . I have seen this question where this was done easily using BeautifulSoup.

Edit: This isn't related exactly but this is an ElementTree approach to extract all text.

w = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main' 
    for p in source.findall('.//{' + w + '}p'):
        print ''.join(t.text for t in p.findall('.//{' + w + '}t'))

falsetru · Accepted Answer

Slightly modifying the you will get what you want:

>>> w = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'    
>>> for t in tree.findall('.//{%(ns)s}i/../..//{%(ns)s}t' % {'ns': w}):
...     print t.text
... 
I need to extract this sentence.

BTW, if you use local-name(), you don't need to specify namespace (need to use xpath method, which is available in lxml):

>>> for t in tree.xpath('.//*[local-name()="i"]/../..//*[local-name()="t"]'):
...     print t.text
... 
I need to extract this sentence.

UPDATE

.. in the expression selects parent node of the current node. So, {...}i/../.. will select grand-parent node of i node.

Extract text with lxml

Answers (2)

Related Questions