Python XML parser

Question

I have a complex XML I need to parse. I know how to parse some important tags.

XML data


    Director
    Yōjirō Arai

XML full data







Taifū no Noruda
台風のノルダ

On a certain isolated island, at a certain middle school, on the eve of the culture festival, Shūichi Azuma quits baseball after playing his whole life. He has a fight with his best friend Kenta Saijō. Then they suddenly meet a mysterious, red-eyed girl named Noruda, and a huge typhoon hits the middle school.

2015-06-05
2015-06-05 (Japan)

"Arashi no Ato de" (嵐のあとで; After the Storm) by Galileo Galilei

「台風のノルダ」公式サイト

Studio Colorido Unveils Typhoon Noruda Anime Film


Studio Colorido's Taifū no Noruda Film Unveils Cast, More Staff, Theme Song Band


Director
Yōjirō Arai


Music
Masashi Hamauzu


Character Design
Hiroyasu Ishida


Art Director
Mika Nishimura


Animation Director
Hiroyasu Ishida


Sound Director
Satoshi Motoyama


Cgi Director
Norihiko Miyoshi


Director of Photography
Mitsuhiro Sato


Shūichi Azuma
Shūhei Nomura


Kenta Saijō
Daichi Kaneko


Noruda
Kaya Kiyohara


Animation Production
Studio Colorido

Python code

#! /usr/bin/Python

# Import xml parser.
import xml.etree.ElementTree as ElementTree

# Import url library.
from urllib.request import urlopen

# Import sys library.
import sys

# XML to parse.
sampleUrl = "http://cdn.animenewsnetwork.com/encyclopedia/api.xml?anime="

# Get the number of params we have in our application.
params = len (sys.argv)

# Check the number of params we have.
if (params == 1):
    print ("We need at least 1 anime identifier.")
else:
    for aid in range (1, params):
        # Read the xml as a file.
        content = urlopen (sampleUrl + sys.argv[aid])

        # XML content is stored here to start working on it.
        xmlData = content.readall().decode('utf-8')

        # Close the file.
        content.close()

        # Start parsing XML.
        root = ElementTree.fromstring (xmlData)

        # Extract classic data.
        for info in root.iter("anime"):
            print ("Id: " + info.get("id"))
            print ("Gid: " + info.get("gid"))
            print ("Name: " + info.get("name"))
            print ("Precision: " + info.get("precision"))
            print ("Type: " + info.get("type"))

        # Extract date and general poster.
        for info in root.iter ("info"):
            if ("Vintage" in info.get("type")):
                print ("Date: " + info.text)

            if ("Picture" in info.get("type")):
                print ("Poster: " + info.get("src"))

        # Extract aditional posters.
        for img in root.iter ("img"):
            print ("Poster: " + img.get("src"))

        print ("")

        # Extract all the staff of this anime.
        result = {}
        for staff in root.getiterator ("staff"):
            # Initialize values.
            task = ""
            value = {}

            for elem in staff.getchildren():
                if elem.tag == "task" :
                    task = elem.text
                elif elem.tag == "person" :
                    tmp = elem.text

                    if "id" in tmp:
                        value["id"] = tmp["id"]
                    value["name"] = elem.text
            if task :
                result[task] = value
        print (result)

I'm using xml.etree.ElementTree to parse the entire XML. But I have problems to parse this section as one element. I need to store all data in another database as one field.

I need all this data together to realize this.

Sample: { "Director" : {"Name": "Yojiro Arai", "id" : "103045} }

I don't know how to do this with the library ElementTree

Thanks for the help.

Vivek Sable · Accepted Answer

Parse input XML by xml.etree.ElementTree module.
Iterate every staff tag from the Parser Object by getiterator.
Iterate every child element of staff tag by getchildren().
Create Dictionary.

Demo:

import xml.etree.ElementTree as PARSER

data = """

    
        Director
        ABC
    
    
        Director1
        XYZ
    
    
    """

root = PARSER.fromstring(data)
result = {}
for i in root.getiterator("staff"):
    key = ""
    value = {}
    for j in i.getchildren():
        if j.tag=="task":
            key = j.text
        elif j.tag=="person":
            tmp = j.attrib
            if "id" in tmp:
                value["id"] = tmp["id"]
            value["name"] = j.text

    if key:
        result[key] = value

print result

Output:

{'Director': {'id': '103045', 'name': 'ABC'}, 'Director1': {'id': '1030452', 'name': 'XYZ'}}

Python XML parser

Answers (1)

Related Questions