user2565150
user2565150

Reputation: 83

LXML Xpath does not seem to return full path

OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.

I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.

I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.

I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.

<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>

That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.

Here is my code:

xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
    single_tag = '%s,%s' % (i.tag, i.text)
    every_tag.append(single_tag)
print(every_tag)

This gives:

'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',

As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13

Can anyone point me towards what I need?

Thanks (hope my first post is OK)

Adam

EDIT:

This is my code now: with open(fullpath, 'rb') as xmlFilepath: xmlfile = xmlFilepath.read()

fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)


xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)

every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)

But this returns an error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

I remove the first two chars as thy are b' and it complained it didn't start with a tag

Update:

I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.

Any help on how I can achieve this?

Upvotes: 2

Views: 2630

Answers (2)

Brecht Machiels
Brecht Machiels

Reputation: 3410

getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:

def human_xpath(element):
    full_xpath = element.getroottree().getpath(element)
    xpath = ''
    human_xpath = ''
    for i, node in enumerate(full_xpath.split('/')[1:]):
        xpath += '/' + node
        element = element.xpath(xpath)[0]
        namespace, tag = element.tag[1:].split('}', 1)
        if element.getparent() is not None:
            nsmap = {'ns': namespace}
            same_name = element.getparent().xpath('./ns:' + tag,
                                                  namespaces=nsmap)
            if len(same_name) > 1:
                tag += '[{}]'.format(same_name.index(element) + 1)
        human_xpath += '/' + tag
    return human_xpath

Upvotes: 2

alecxe
alecxe

Reputation: 473753

ElementTree objects have a method getpath(element), which returns a structural, absolute XPath expression to find that element

Calling getpath on each element in a iter() loop should work for you:

from pprint import pprint
from lxml import etree


text = """
<TechAccount Sender="broker" Receiver="insurer">
    <UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
    <BrokerReference>HOY123/456</BrokerReference>
    <ServiceProviderReference>2012080921401A1</ServiceProviderReference>
    <CreationDate>2012-08-10</CreationDate>
    <AccountTransactionType>premium</AccountTransactionType>
    <GroupReference>2012080921401A1</GroupReference>
    <ItemsInGroupTotal>
        <Count>1</Count>
    </ItemsInGroupTotal>
    <ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
    <ServiceProviderGroupItemsTotal>
        <Count>13</Count>
    </ServiceProviderGroupItemsTotal>
</TechAccount>
"""

xml = etree.fromstring(text)
tree = etree.ElementTree(xml)

every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)

prints:

['/TechAccount, \n',
 '/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
 '/TechAccount/BrokerReference, HOY123/456',
 '/TechAccount/ServiceProviderReference, 2012080921401A1',
 '/TechAccount/CreationDate, 2012-08-10',
 '/TechAccount/AccountTransactionType, premium',
 '/TechAccount/GroupReference, 2012080921401A1',
 '/TechAccount/ItemsInGroupTotal, \n',
 '/TechAccount/ItemsInGroupTotal/Count, 1',
 '/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
 '/TechAccount/ServiceProviderGroupItemsTotal, \n',
 '/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']

UPD: If your xml data is in the file test.xml, the code would look like:

from pprint import pprint
from lxml import etree

xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)

every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)

Hope that helps.

Upvotes: 2

Related Questions