Reputation: 11321
I'm trying to teach myself how to parse XML. I've read the lxml tutorials, but they're hard to understand. So far, I can do:
>>> from lxml import etree
>>> xml=etree.parse('ham.xml')
>>> xml
<lxml.etree._ElementTree object at 0x118de60>
But how can I get data from this object? It can't be indexed like xml[0]
, and it can't be iterated over.
More specifically, I'm using this xml file and I'm trying to extract, say, everything between the <l>
tags that's surrounded by <sp>
tags that contain, say, the Barnardo
attribute.
Upvotes: 0
Views: 2132
Reputation: 2051
One way to parse the XML is using XPath. You can call the xpath()
member function for an ElementTree
, in your case xml
.
As an example, to print the XML for all the <l>
elements (lines of the play).
subtrees = xml.xpath('//l', namespaces={'prefix': 'http://www.tei-c.org/ns/1.0'})
for l in subtrees:
print(etree.tostring(l))
The lxml docs detail the xpath functionality.
As pointed out below this doesn't work unless a namespace is specified. Unfortunately the empty namespace is not supported by lxml
, but you can change the root node to use a namespace named prefix
, which is also the name used above.
<TEI xmlns:prefix="http://www.tei-c.org/ns/1.0" xml:id="sha-ham">
Upvotes: 2
Reputation: 1123460
It is a ElementTree Element
object.
You can also look at the lxml API documentation, which has an lxml.etree._Element
page. That page tells you about every single attribute and method on that class you could ever want to know about.
I'd start with reading the lxml.etree
tutorial, however.
If the element cannot be indexed, however, it is an empty tag, and there are no child nodes to retrieve.
To find all lines by Bernardo
, an XPath expression is needed, with a namespace map. It doesn't matter what prefix you use, as long as it is a non-empty string lxml
will map it to the correct namespace URL:
nsmap = {'s': 'http://www.tei-c.org/ns/1.0'}
for line in tree.xpath('.//s:sp[@who="Barnardo"]/s:l/text()', namespaces=nsmap):
print line.strip()
This extracts all text in <l>
elements that are contained in <sp who="Barnardo">
tags. Note the s:
prefixes on the tag names, the nsmap
dictionary tells lxml
what namespace to use. I printed these without the surrounding extra whitespace.
For your sample document, that gives:
>>> for line in tree.xpath('.//s:sp[@who="Barnardo"]/s:l/text()', namespaces=nsmap):
... print line.strip()
...
Who's there?
Long live the king!
He.
'Tis now struck twelve; get thee to bed, Francisco.
Have you had quiet guard?
Well, good night.
If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
Say,
What, is Horatio there?
Welcome, Horatio: welcome, good Marcellus.
I have seen nothing.
Sit down awhile;
And let us once again assail your ears,
That are so fortified against our story
What we have two nights seen.
Last night of all,
When yond same star that's westward from the pole
Had made his course to illume that part of heaven
Where now it burns, Marcellus and myself,
The bell then beating one,
In the same figure, like the king that's dead.
Looks 'a not like the king? mark it, Horatio.
It would be spoke to.
See, it stalks away!
How now, Horatio! you tremble and look pale:
Is not this something more than fantasy?
What think you on't?
I think it be no other but e'en so:
Well may it sort that this portentous figure
Comes armed through our watch; so like the king
That was and is the question of these wars.
'Tis here!
It was about to speak, when the cock crew.
Upvotes: 2