Blue-Banana
Blue-Banana

Reputation: 13

Python: parse xml file built up with dicts

[Python 3.4][Windows 7]

If there is any easy way to get a whole .xml file like a .txt as one string, that would be enough, but to describe the problem precisely:

This is the first time for me to deal with a .xml file. I have a .xml file containing mainly dictionaries (of further dictionaries). It also says now, i want to get very certain keys and values out of the dictionaries and write them in a .txt file, so therefore a dict (or sth else) in python would be enough.

To make an example:

This is the xml file (library.xml):

<?xml version="1.0" encoding="UTF-8"?>
<plist version="1.0">
<dict>
    <key>Version<\key><integer>1</integer>
    <key>Tracks</key>
    <dict>
        <key>0001</key>
        <dict>
            <key>Name</key><string>spam</string>
            <key>Detail</key><string>spam spam</string>
        </dict>
        <key>0002</key>
        <dict>
            <key>Name</key><string>ham</string>
            <key>Detail</key><string>ham ham</string>
        </dict>
    </dict>
</dict>
</plist>

I researched and thought i can do it with the xml.etree.ElementTree module: So if i try this:

tree = ET.parse('library.xml')
root = tree.getroot()

I only get this message:

(Unicode Error) 'unicodeescape' codec can't decode bytes…

What I want is obviously some kind of this (or as a dict, it doesnt matter)

[['Name: spam', 'Detail: spam spam'], ['Name: ham', 'Detail: ham ham']

EDIT: xml code was incorrect, sry EDIT: Added last paragraph

Upvotes: 0

Views: 1296

Answers (3)

Blue-Banana
Blue-Banana

Reputation: 13

i just wanted to let u know that i've just solved it this way:

with open('library.xml',
          'r', encoding='UTF-8') as file:

(and some regular expression to get the dicts i want)

this is probably very inefficient since i read the complete file as text but actually i dont care about efficiency, because the function has only one call in my program ;)

Upvotes: 0

chthonicdaemon
chthonicdaemon

Reputation: 19760

The Python standard library contains a module that reads plist files: plistlib. You can use it to solve your problem with an import and one command:

import plistlib

print plistlib.readPlist('library.xml')

Output:

{'Tracks': {'0001': {'Detail': 'spam spam', 'Name': 'spam'},
  '0002': {'Detail': 'ham ham', 'Name': 'ham'}},
 'Version': 1}

Upvotes: 1

Vivek Sable
Vivek Sable

Reputation: 10213

Update input content from <\key> to </key> and removed dict tag because key is not define for that.

  1. Parse XML data by lxml.html module.
  2. Get target main dict tag by xpath() method.
  3. Call XMLtoDict() function.
  4. Iterate on children of input tag by getchildren() method and for loop.
  5. Check tag name is key or not by if loop.
  6. If yes then get next tag of current tag by getnext() method.
  7. If next tag is integer tag then get value type int.
  8. If next tag is string tag then value type is string.
  9. If next tag is dict tag then value type is dict and call function again i.e. recursive call.
  10. Add key and value into result dictionary.
  11. return result dictionary.
  12. print result dictionary.

code:

data = """<?xml version="1.0" encoding="UTF-8"?>
<plist version="1.0">
    <dict>
        <key>Version</key>
        <integer>1</integer>
        <key>Tracks</key>
        <dict>
            <key>0001</key>
            <dict>
                <key>Name</key><string>spam</string>
                <key>Detail</key><string>spam spam</string>
            </dict>
            <key>0002</key>
            <dict>
                <key>Name</key><string>ham</string>
                <key>Detail</key><string>ham ham</string>
            </dict>
        </dict>
    </dict>
</plist>
"""

def XMLtoDict(root):
    result = {}
    for i in root.getchildren():
        if i.tag=="key":
            key = i.text
            next_tag = i.getnext()
            next_tag_name = next_tag.tag
            if next_tag_name=="integer":
                value = int(next_tag.text)
            elif next_tag_name=='string':
                value = next_tag.text
            elif next_tag_name=='dict':
                value = XMLtoDict(next_tag)
            else:
                value = None
            result[key] = value

    return dict(result)


import lxml.html as ET
import pprint

root = ET.fromstring(data)
result = XMLtoDict(root.xpath("//plist/dict")[0])
pprint.pprint(result)

Output:

vivek@vivek:~/Desktop/stackoverflow$ python 12.py 
{'Tracks': {'0001': {'Detail': 'spam spam', 'Name': 'spam'},
            '0002': {'Detail': 'ham ham', 'Name': 'ham'}},
 'Version': 1}

  1. I am not getting such exception.

    (Unicode Error) 'unicodeescape' codec can't decode bytes…

  2. Tagging not correct in library.xml

    import xml.etree.ElementTree as ET tree = ET.parse('library.xml')

Get following exception for input

vivek@vivek:~/Desktop/stackoverflow$ python 12.py 
Traceback (most recent call last):
  File "12.py", line 46, in <module>
    tree = ET.parse('library.xml')
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
    tree.parse(source, parser)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
    parser.feed(data)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
    self._raiseerror(v)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 4, column 15

This exception due to invalid tagging. To fix this exception, do following:

Change from <key>Version<\key> to <key>Version</key>

  1. By xml.etree.ElementTree module:

code:

def XMLtoDict(root):
    result = {}
    chidren_tags = root.getchildren()
    for j, i in enumerate(chidren_tags):
        if i.tag=="key":
            key = i.text
            next_tag = chidren_tags[j+1]
            next_tag_name = next_tag.tag
            if next_tag_name=="integer":
                value = int(next_tag.text)
            elif next_tag_name=='string':
                value = next_tag.text
            elif next_tag_name=='dict':
                value = XMLtoDict(next_tag)
            else:
                value = None
            result[key] = value

    return dict(result)


def XMLtoList(root):
    result = []
    chidren_tags = root.getchildren()
    for j, i in enumerate(chidren_tags):
        if i.tag=="key":
            key = i.text
            next_tag = chidren_tags[j+1]
            next_tag_name = next_tag.tag
            if next_tag_name=="integer":
                value = int(next_tag.text)
            elif next_tag_name=='string':
                value = next_tag.text
            elif next_tag_name=='dict':
                value = XMLtoList(next_tag)
            else:
                value = None
            result.append([key, value])

    return list(result)


import xml.etree.ElementTree as ET
import pprint

tree = ET.parse('library.xml')
root = tree.getroot()

dict_tag = root.find("dict")
if dict_tag is not None:
    result = XMLtoDict(dict_tag)
    print "Result in Dictinary:-"
    pprint.pprint(result)

    result = XMLtoList(dict_tag)
    print "\nResult in Dictinary:-"
    pprint.pprint(result)

output: vivek@vivek:~/Desktop/stackoverflow$ python 12.py

Result in Dictinary:-
{'Tracks': {'0001': {'Detail': 'spam spam', 'Name': 'spam'},
            '0002': {'Detail': 'ham ham', 'Name': 'ham'}},
 'Version': 1}

Result in Dictinary:-
[['Version', 1],
 ['Tracks',
  [['0001', [['Name', 'spam'], ['Detail', 'spam spam']]],
   ['0002', [['Name', 'ham'], ['Detail', 'ham ham']]]]]]

Upvotes: 0

Related Questions