Ismael Moral
Ismael Moral

Reputation: 732

Python XML parser

I have a complex XML I need to parse. I know how to parse some important tags.

XML data

<staff gid="2027930674">
    <task>Director</task>
    <person id="103045">Yōjirō Arai</person>
</staff>

XML full data

<ann>
<anime id="16989" gid="1524403706" type="movie" name="Taifū no Noruda" precision="movie" generated-on="2015-04-27T08:05:39Z">
<info gid="1917137337" type="Picture" src="http://cdn.animenewsnetwork.com/thumbnails/fit200x200/encyc/A16989-1917137337.1429892764.jpg" width="141" height="200">
<img src="http://cdn.animenewsnetwork.com/thumbnails/hotlink-fit200x200/encyc/A16989-1917137337.1429892764.jpg" width="141" height="200"/>
<img src="http://cdn.animenewsnetwork.com/thumbnails/hotlink-max500x600/encyc/A16989-1917137337.1429892764.jpg" width="353" height="500"/>
</info>
<info gid="1994323462" type="Main title" lang="JA">Taifū no Noruda</info>
<info gid="1715491679" type="Alternative title" lang="JA">台風のノルダ</info>
<info gid="898837990" type="Plot Summary">
On a certain isolated island, at a certain middle school, on the eve of the culture festival, Shūichi Azuma quits baseball after playing his whole life. He has a fight with his best friend Kenta Saijō. Then they suddenly meet a mysterious, red-eyed girl named Noruda, and a huge typhoon hits the middle school.
</info>
<info type="Vintage">2015-06-05</info>
<info gid="2492283870" type="Premiere date">2015-06-05 (Japan)</info>
<info gid="2453949568" type="Ending Theme">
"Arashi no Ato de" (嵐のあとで; After the Storm) by Galileo Galilei
</info>
<info gid="3199882585" type="Official website" lang="JA" href="http://typhoon-noruda.com/">「台風のノルダ」公式サイト</info>
<news datetime="2015-04-09T17:20:00Z" href="http://www.animenewsnetwork.com:/news/2015-04-09/studio-colorido-unveils-typhoon-noruda-anime-film/.86937">
Studio Colorido Unveils <cite>Typhoon Noruda</cite> Anime Film
</news>
<news datetime="2015-04-24T08:00:00Z" href="http://www.animenewsnetwork.com:/news/2015-04-24/studio-colorido-taifu-no-noruda-film-unveils-cast-more-staff-theme-song-band/.87470">
Studio Colorido's <i>Taifū no Noruda</i> Film Unveils Cast, More Staff, Theme Song Band
</news>
<staff gid="2027930674">
<task>Director</task>
<person id="103045">Yōjirō Arai</person>
</staff>
<staff gid="3870106504">
<task>Music</task>
<person id="110581">Masashi Hamauzu</person>
</staff>
<staff gid="2732633345">
<task>Character Design</task>
<person id="135767">Hiroyasu Ishida</person>
</staff>
<staff gid="1532205853">
<task>Art Director</task>
<person id="52564">Mika Nishimura</person>
</staff>
<staff gid="1006708772">
<task>Animation Director</task>
<person id="135767">Hiroyasu Ishida</person>
</staff>
<staff gid="934584477">
<task>Sound Director</task>
<person id="8849">Satoshi Motoyama</person>
</staff>
<staff gid="1138447906">
<task>Cgi Director</task>
<person id="42135">Norihiko Miyoshi</person>
</staff>
<staff gid="3178797981">
<task>Director of Photography</task>
<person id="24382">Mitsuhiro Sato</person>
</staff>
<cast gid="2645091588" lang="JA">
<role>Shūichi Azuma</role>
<person id="135769">Shūhei Nomura</person>
</cast>
<cast gid="2397297323" lang="JA">
<role>Kenta Saijō</role>
<person id="135770">Daichi Kaneko</person>
</cast>
<cast gid="2417172290" lang="JA">
<role>Noruda</role>
<person id="135771">Kaya Kiyohara</person>
</cast>
<credit gid="2574178211">
<task>Animation Production</task>
<company id="13518">Studio Colorido</company>
</credit>
</anime>
</ann>

Python code

#! /usr/bin/Python

# Import xml parser.
import xml.etree.ElementTree as ElementTree

# Import url library.
from urllib.request import urlopen

# Import sys library.
import sys

# XML to parse.
sampleUrl = "http://cdn.animenewsnetwork.com/encyclopedia/api.xml?anime="

# Get the number of params we have in our application.
params = len (sys.argv)

# Check the number of params we have.
if (params == 1):
    print ("We need at least 1 anime identifier.")
else:
    for aid in range (1, params):
        # Read the xml as a file.
        content = urlopen (sampleUrl + sys.argv[aid])

        # XML content is stored here to start working on it.
        xmlData = content.readall().decode('utf-8')

        # Close the file.
        content.close()

        # Start parsing XML.
        root = ElementTree.fromstring (xmlData)

        # Extract classic data.
        for info in root.iter("anime"):
            print ("Id: " + info.get("id"))
            print ("Gid: " + info.get("gid"))
            print ("Name: " + info.get("name"))
            print ("Precision: " + info.get("precision"))
            print ("Type: " + info.get("type"))

        # Extract date and general poster.
        for info in root.iter ("info"):
            if ("Vintage" in info.get("type")):
                print ("Date: " + info.text)

            if ("Picture" in info.get("type")):
                print ("Poster: " + info.get("src"))

        # Extract aditional posters.
        for img in root.iter ("img"):
            print ("Poster: " + img.get("src"))

        print ("")

        # Extract all the staff of this anime.
        result = {}
        for staff in root.getiterator ("staff"):
            # Initialize values.
            task = ""
            value = {}

            for elem in staff.getchildren():
                if elem.tag == "task" :
                    task = elem.text
                elif elem.tag == "person" :
                    tmp = elem.text

                    if "id" in tmp:
                        value["id"] = tmp["id"]
                    value["name"] = elem.text
            if task :
                result[task] = value
        print (result)

I'm using xml.etree.ElementTree to parse the entire XML. But I have problems to parse this section as one element. I need to store all data in another database as one field.

I need all this data together to realize this.

Sample: { "Director" : {"Name": "Yojiro Arai", "id" : "103045} }

I don't know how to do this with the library ElementTree

Thanks for the help.

Upvotes: 1

Views: 934

Answers (1)

Vivek Sable
Vivek Sable

Reputation: 10223

  1. Parse input XML by xml.etree.ElementTree module.
  2. Iterate every staff tag from the Parser Object by getiterator.
  3. Iterate every child element of staff tag by getchildren().
  4. Create Dictionary.

Demo:

import xml.etree.ElementTree as PARSER

data = """
<xml>
    <staff gid="2027930674">
        <task>Director</task>
        <person id="103045">ABC</person>
    </staff>
    <staff gid="2027930674">
        <task>Director1</task>
        <person id="1030452">XYZ</person>
    </staff>
</xml>    
    """

root = PARSER.fromstring(data)
result = {}
for i in root.getiterator("staff"):
    key = ""
    value = {}
    for j in i.getchildren():
        if j.tag=="task":
            key = j.text
        elif j.tag=="person":
            tmp = j.attrib
            if "id" in tmp:
                value["id"] = tmp["id"]
            value["name"] = j.text

    if key:
        result[key] = value

print result

Output:

{'Director': {'id': '103045', 'name': 'ABC'}, 'Director1': {'id': '1030452', 'name': 'XYZ'}}

Upvotes: 6

Related Questions