astromax
astromax

Reputation: 6331

Parsing XML data within tags using lxml in python

My question is regarding how to get information stored in a tag which allows for no closing tag. Here's the relevant xml:

<?xml version="1.0" encoding="UTF-8"?>
<uws:job>  
<uws:results>
    <uws:result id="2014-03-03T15:42:31:1337" xlink:href="http://www.cosmosim.org/query/index/stream/table/2014-03-03T15%3A42%3A31%3A1337/format/csv" xlink:type="simple"/>
</uws:results>
</uws:job>

I'm looking to extract the xlink:href url here. As you can see the uws:result tag requires no closing tag. Additionally, having the 'uws:' makes it a bit tricky to handle them when working in python. Here's what I've tried so far:

from lxml import etree
root = etree.fromstring(xmlresponse.content)
url = root.find('{*}results').text

Where xmlresponse.content is the xml data to be parsed. What this returns is

'\n    '

which indicates that it's only finding the newline character, since what I'm really after is contained within a tag inside the results tag. Any ideas would be greatly appreciated.

Upvotes: 3

Views: 415

Answers (1)

Corley Brigman
Corley Brigman

Reputation: 12401

You found the right node; you extracted the data incorrectly. Instead of

url = root.find('{*}results').text

you really want

url = root.find('{*}results').get('attribname', 'value_to_return_if_not_present')

or

url = root.find('{*}results').attrib['attribname']

(which will throw an exception if not present).

Because of the namespace on the attribute itself, you will probably need to use the {ns}attrib syntax to look it up too.

You can dump out the attrib dictionary and just copy the attribute name out too.

text is actually the space between elements, and is not normally used but is supported both for spacing (like etreeindent) and some special cases.

Upvotes: 2

Related Questions