Reputation: 135
I am currently parsing a Wikipedia dump, trying to extract some useful information. The parsing takes place in XML, and I want to extract only the text / content for each page. Now I'm wondering how you can find all text inside a tag that is inside another tag. I searched for similar questions, but only found the ones having problems with a singular tag. Here is an example of what I want to achieve:
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor>
<username>Foobar</username>
<id>65536</id>
</contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [[text]] here.</text>
<minor />
</revision>
<example_tag>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor>
<username>Foobar</username>
<id>65536</id>
</contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [[text]] here.</text>
<minor />
</example_tag>
How can I extract the text inside the text tag, but only when it is included in the revision tree?
Upvotes: 2
Views: 3584
Reputation: 477794
You can use the xml.etree.elementtree
package for that and use an XPath query:
import xml.etree.ElementTree as ET
root = ET.fromstring(the_xml_string)
for content in root.findall('.//revision/othertag'):
# ... process content, for instance
print(content.text)
(where the_xml_string
is a string containing the XML code).
Or obtain a list of the text elements with list comprehension:
import xml.etree.ElementTree as ET
texts = [content.text for content in ET.fromstring(the_xml_string).findall('.//revision/othertag')]
So the .text
has the inner text. Note that you will have to replace othertag
with the tag (for instance text
). If that tag can be arbitrary deep in the revision
tag, you should use .//revision//othertag
as XPath query.
Upvotes: 3