Reputation: 21
I have an xml string with the following doctype syntax. how do I parse it? I should be able to get each of the filenames in the SYSTEM tag.
'''<xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE config SYSTEM "ncfg_config.dtd"
[
<!ENTITY vlan_map_type SYSTEM "types/a.xml">
<!ENTITY oui_type SYSTEM "types/b.xml">
<!ENTITY provisioning_profile SYSTEM "c.xml">
<!ENTITY vlan_name_or_list SYSTEM "types/d.xml">
<!ENTITY vlan_name_or_num SYSTEM "types/e.xml">
<!ENTITY interface_list SYSTEM "types/f.xml">
<!ENTITY mac_limit_type SYSTEM "types/g.xml">
]>'''
Upvotes: 0
Views: 345
Reputation: 4487
If the format is strict to your example, then using regex would be easier:
import re
xml = '''<xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE config SYSTEM "ncfg_config.dtd"
[
<!ENTITY vlan_map_type SYSTEM "types/a.xml">
<!ENTITY oui_type SYSTEM "types/b.xml">
<!ENTITY provisioning_profile SYSTEM "c.xml">
<!ENTITY vlan_name_or_list SYSTEM "types/d.xml">
<!ENTITY vlan_name_or_num SYSTEM "types/e.xml">
<!ENTITY interface_list SYSTEM "types/f.xml">
<!ENTITY mac_limit_type SYSTEM "types/g.xml">
]>'''
file_names = re.findall(r'<!ENTITY .* SYSTEM "(.*?)">',xml)
for name in file_names:
print name
Output:
types/a.xml
types/b.xml
c.xml
types/d.xml
types/e.xml
types/f.xml
types/g.xml
Upvotes: 1