J. Williams
J. Williams

Reputation: 135

Python: extract text from tag inside tag in XML Tree

I am currently parsing a Wikipedia dump, trying to extract some useful information. The parsing takes place in XML, and I want to extract only the text / content for each page. Now I'm wondering how you can find all text inside a tag that is inside another tag. I searched for similar questions, but only found the ones having problems with a singular tag. Here is an example of what I want to achieve:

  <revision>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </revision>

  <example_tag>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </example_tag>

How can I extract the text inside the text tag, but only when it is included in the revision tree?

Upvotes: 2

Views: 3584

Answers (1)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 477794

You can use the xml.etree.elementtree package for that and use an XPath query:

import xml.etree.ElementTree as ET

root = ET.fromstring(the_xml_string)
for content in root.findall('.//revision/othertag'):
    # ... process content, for instance
    print(content.text)

(where the_xml_string is a string containing the XML code).

Or obtain a list of the text elements with list comprehension:

import xml.etree.ElementTree as ET

texts = [content.text for content in ET.fromstring(the_xml_string).findall('.//revision/othertag')]

So the .text has the inner text. Note that you will have to replace othertag with the tag (for instance text). If that tag can be arbitrary deep in the revision tag, you should use .//revision//othertag as XPath query.

Upvotes: 3

Related Questions