Jeff P
Jeff P

Reputation: 345

Basic Python Parsing XML with xml.etree - Issue

I am trying to parse XML and am hard time having. I dont understand why the results keep printing [<Element 'Results' at 0x105fc6110>] I am trying to extract Social from my example with the

import xml.etree.ElementTree as ET
root = ET.parse("test.xml")
results = root.findall("Results")
print results #[<Element 'Results' at 0x105fc6110>]
              # WHAT IS THIS??


for result in results:
    print result.find("Social") #None

the XML looks like this:

<?xml version="1.0"?>
<List1>
    <NextOffset>AAA</NextOffset>
    <Results>
        <R>
            <D>internet.com</D>
            <META>
                <Social>
                    <v>http://twitter.com/internet</v>
                    <v>http://facebook.com/internet</v>
                </Social>
                <Telephones>
                    <v>+1-555-555-6767</v>
                </Telephones>
            </META>
        </R>
    </Results>
</List1>

Upvotes: 2

Views: 636

Answers (2)

mikerose
mikerose

Reputation: 76

results = root.findall("Results") is a list of xml.etree.ElementTree.Element objects.

type(results)
# list
type(results[0])
# xml.etree.ElementTree.Element

find and findall only look within first children. The iter method will iterate through matching sub-children at any level.

Option 1

If <Results> could potentially have more than one <Social> element, you could use this:

for result in results:
    for soc in result.iter("Social"):
        for link in soc.iter("v"):
            print link.text

That's worst case scenario. If you know there'll be one <Social> per <Results> then it simplifies to:

for soc in root.iter("Social"):
    for link in soc.iter("v"):
        print link.text

both return

"http://twitter.com/internet"
"http://facebook.com/internet"

Option 2

Or use nested list comprehensions and do it with one line of code. Because Python...

socialLinks = [[v.text for v in soc] for soc in root.iter("Social")]

# socialLinks == [['http://twitter.com/internet', 'http://facebook.com/internet']]

socialLinks is list of lists. The outer list is of <Social> elements (only one in this example)
Each inner list contains the text from the v elements within each particular <Social> element .

Upvotes: 2

Jean-Fran&#231;ois Fabre
Jean-Fran&#231;ois Fabre

Reputation: 140148

findall returns a list of xml.etree.ElementTree.Element objects. In your case, you only have 1 Result node, so you could use find to look for the first/unique match.

Once you got it, you have to use find using the .// syntax which allows to search in anywhere in the tree, not only the one directly under Result.

Once you found it, just findall on v tag and print the text:

import xml.etree.ElementTree as ET
root = ET.parse("test.xml")
result = root.find("Results")

social = result.find(".//Social")

for r in social.findall("v"):
    print(r.text)

results in:

http://twitter.com/internet
http://facebook.com/internet

note that I did not perform validity check on the xml file. You should check if the find method returns None and handle the error accordignly.

Note that even though I'm not confident myself with xml format, I learned all that I know on parsing it by following this lxml tutorial.

Upvotes: 2

Related Questions