Python: extract text from tag inside tag in XML Tree

Question

I am currently parsing a Wikipedia dump, trying to extract some useful information. The parsing takes place in XML, and I want to extract only the text / content for each page. Now I'm wondering how you can find all text inside a tag that is inside another tag. I searched for similar questions, but only found the ones having problems with a singular tag. Here is an example of what I want to achieve:

  
    2001-01-15T13:15:00Z
    
      Foobar
      65536
    
    I have just one thing to say!
    A bunch of [[text]] here.
    
  

  
    2001-01-15T13:15:00Z
    
      Foobar
      65536
    
    I have just one thing to say!
    A bunch of [[text]] here.

How can I extract the text inside the text tag, but only when it is included in the revision tree?

willeM_ Van Onsem · Accepted Answer

You can use the xml.etree.elementtree package for that and use an XPath query:

import xml.etree.ElementTree as ET

root = ET.fromstring(the_xml_string)
for content in root.findall('.//revision/othertag'):
    # ... process content, for instance
    print(content.text)

(where the_xml_string is a string containing the XML code).

Or obtain a list of the text elements with list comprehension:

import xml.etree.ElementTree as ET

texts = [content.text for content in ET.fromstring(the_xml_string).findall('.//revision/othertag')]

So the .text has the inner text. Note that you will have to replace othertag with the tag (for instance text). If that tag can be arbitrary deep in the revision tag, you should use .//revision//othertag as XPath query.

Python: extract text from tag inside tag in XML Tree

Answers (1)

Related Questions