Reputation: 48490
I have many rows in XML and I'm trying to get instances of a particular node attribute.
<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
How do I access the values of the attribute foobar
? In this example, I want "1"
and "2"
.
Upvotes: 1165
Views: 1581670
Reputation: 3581
Use pandas. Pandas have a function read_xml()
, what is perfect for such flat XML structures.
import pandas as pd
xml = """<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>"""
df = pd.read_xml(xml, xpath=".//type")
print(df)
Output:
foobar
0 1
1 2
Upvotes: 2
Reputation: 331
simplified_scrapy
: a new library. I fell in love with it after I used it. I recommend it to you.
from simplified_scrapy import SimplifiedDoc
xml = '''
<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
'''
doc = SimplifiedDoc(xml)
types = doc.selects('bar>type')
print (len(types)) # 2
print (types.foobar) # ['1', '2']
print (doc.selects('bar>type>foobar()')) # ['1', '2']
Here are more examples. This library is easy to use.
Upvotes: 4
Reputation: 2313
If you don't want to use any external libraries or third-party tools, please try the below code.
dictionary
<tag/>
and tags with only attributes like <tag var=val/>
import re
def getdict(content):
res=re.findall("<(?P<var>\S*)(?P<attr>[^/>]*)(?:(?:>(?P<val>.*?)</(?P=var)>)|(?:/>))",content)
if len(res)>=1:
attreg="(?P<avr>\S+?)(?:(?:=(?P<quote>['\"])(?P<avl>.*?)(?P=quote))|(?:=(?P<avl1>.*?)(?:\s|$))|(?P<avl2>[\s]+)|$)"
if len(res)>1:
return [{i[0]:[{"@attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,i[1].strip())]},{"$values":getdict(i[2])}]} for i in res]
else:
return {res[0]:[{"@attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,res[1].strip())]},{"$values":getdict(res[2])}]}
else:
return content
with open("test.xml","r") as f:
print(getdict(f.read().replace('\n','')))
<details class="4b" count=1 boy>
<name type="firstname">John</name>
<age>13</age>
<hobby>Coin collection</hobby>
<hobby>Stamp collection</hobby>
<address>
<country>USA</country>
<state>CA</state>
</address>
</details>
<details empty="True"/>
<details/>
<details class="4a" count=2 girl>
<name type="firstname">Samantha</name>
<age>13</age>
<hobby>Fishing</hobby>
<hobby>Chess</hobby>
<address current="no">
<country>Australia</country>
<state>NSW</state>
</address>
</details>
[
{
"details": [
{
"@attributes": [
{
"class": "4b"
},
{
"count": "1"
},
{
"boy": ""
}
]
},
{
"$values": [
{
"name": [
{
"@attributes": [
{
"type": "firstname"
}
]
},
{
"$values": "John"
}
]
},
{
"age": [
{
"@attributes": []
},
{
"$values": "13"
}
]
},
{
"hobby": [
{
"@attributes": []
},
{
"$values": "Coin collection"
}
]
},
{
"hobby": [
{
"@attributes": []
},
{
"$values": "Stamp collection"
}
]
},
{
"address": [
{
"@attributes": []
},
{
"$values": [
{
"country": [
{
"@attributes": []
},
{
"$values": "USA"
}
]
},
{
"state": [
{
"@attributes": []
},
{
"$values": "CA"
}
]
}
]
}
]
}
]
}
]
},
{
"details": [
{
"@attributes": [
{
"empty": "True"
}
]
},
{
"$values": ""
}
]
},
{
"details": [
{
"@attributes": []
},
{
"$values": ""
}
]
},
{
"details": [
{
"@attributes": [
{
"class": "4a"
},
{
"count": "2"
},
{
"girl": ""
}
]
},
{
"$values": [
{
"name": [
{
"@attributes": [
{
"type": "firstname"
}
]
},
{
"$values": "Samantha"
}
]
},
{
"age": [
{
"@attributes": []
},
{
"$values": "13"
}
]
},
{
"hobby": [
{
"@attributes": []
},
{
"$values": "Fishing"
}
]
},
{
"hobby": [
{
"@attributes": []
},
{
"$values": "Chess"
}
]
},
{
"address": [
{
"@attributes": [
{
"current": "no"
}
]
},
{
"$values": [
{
"country": [
{
"@attributes": []
},
{
"$values": "Australia"
}
]
},
{
"state": [
{
"@attributes": []
},
{
"$values": "NSW"
}
]
}
]
}
]
}
]
}
]
}
]
Upvotes: 1
Reputation: 69
If the source is an XML file, say like this sample,
<pa:Process xmlns:pa="http://sssss">
<pa:firsttag>SAMPLE</pa:firsttag>
</pa:Process>
you may try the following code:
from lxml import etree, objectify
# This is a sample XML file. The contents is shown above
metadata = 'C:\\Users\\PROCS.xml'
# This line removes the name space from the XML content. In
# this sample, the name space is --> http://sssss
parser = etree.XMLParser(remove_blank_text=True)
# This line parses the XML file, which is PROCS.xml
tree = etree.parse(metadata, parser)
# We get the root of XML content which is processed
# and iterated using a 'for' loop
root = tree.getroot()
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i+1:]
dict={} # A Python dictionary is declared
for elem in tree.iter(): # Iterating through the XML tree using a 'for' loop
if elem.tag =="firsttag": # If the tag name matches the name that
# is equated then the text in the tag
# is stored into the dictionary
dict["FIRST_TAG"]=str(elem.text)
print(dict)
The output would be
{'FIRST_TAG': 'SAMPLE'}
Upvotes: 0
Reputation: 1971
There's no need to use a library-specific API if you use python-benedict
. Just initialize a new instance from your XML content and manage it easily since it is a dict
subclass.
Installation is easy: pip install python-benedict
from benedict import benedict as bdict
# 'data-source' can be a URL, a filepath or data-string (as in this example)
data_source = """
<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>"""
data = bdict.from_xml(data_source)
t_list = data['foo.bar'] # yes, keypath supported
for t in t_list:
print(t['@foobar'])
It supports and normalizes I/O operations with many formats: Base64, CSV, JSON, TOML, XML, YAML and query string.
It is well tested and open source on GitHub. Disclosure: I am the author.
Upvotes: 11
Reputation: 123889
You can use Beautiful Soup:
from bs4 import BeautifulSoup
x="""<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>"""
y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'
>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]
>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'
Upvotes: 272
Reputation: 2428
I suggest xmltodict for simplicity.
It parses your XML to an OrderedDict;
>>> e = '<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo> '
>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result
OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])
>>> result['foo']
OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])
>>> result['foo']['bar']
OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])
Upvotes: 55
Reputation: 1276
There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines
.
The relevant metrics can be found in the table below, copied from the cElementTree website:
library time space
xml.dom.minidom (Python 2.1) 6.3 s 80000K
gnosis.objectify 2.0 s 22000k
xml.dom.minidom (Python 2.4) 1.4 s 53000k
ElementTree 1.2 1.6 s 14500k
ElementTree 1.2.4/1.3 1.1 s 14500k
cDomlette (C extension) 0.540 s 20500k
PyRXPU (C extension) 0.175 s 10850k
libxml2 (C extension) 0.098 s 16000k
readlines (read as utf-8) 0.093 s 8850k
cElementTree (C extension) --> 0.047 s 4900K <--
readlines (read as ascii) 0.032 s 5050k
As pointed out by @jfs, cElementTree
comes bundled with Python:
from xml.etree import cElementTree as ElementTree
.from xml.etree import ElementTree
(the accelerated C version is used automatically).Upvotes: 113
Reputation: 3581
The classic SAX parser solution could work like:
from xml.sax import make_parser, handler
from io import StringIO
xml = """<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
"""
file = StringIO(xml)
class MyParser(handler.ContentHandler):
def startElement(self, name, attrs):
if name == "type":
print(attrs['foobar'])
parser = make_parser()
b = MyParser()
parser.setContentHandler(b)
parser.parse(file)
Output:
1
2
Upvotes: 0
Reputation: 3581
With iterparse() you can catch the tag attribute dictionary value:
import xml.etree.ElementTree as ET
from io import StringIO
xml = """<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
"""
file = StringIO(xml)
for event, elem in ET.iterparse(file, ("end",)):
if event == "end" and elem.tag == "type":
print(elem.attrib["foobar"])
Upvotes: 0
Reputation: 7933
minidom
is the quickest and pretty straight forward.
XML:
<data>
<items>
<item name="item1"></item>
<item name="item2"></item>
<item name="item3"></item>
<item name="item4"></item>
</items>
</data>
Python:
from xml.dom import minidom
dom = minidom.parse('items.xml')
elements = dom.getElementsByTagName('item')
print(f"There are {len(elements)} items:")
for element in elements:
print(element.attributes['name'].value)
Output:
There are 4 items:
item1
item2
item3
item4
Upvotes: 475
Reputation: 882481
I suggest ElementTree
. There are other compatible implementations of the same API, such as lxml
, and cElementTree
in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree
defines.
First build an Element instance root
from the XML, e.g. with the XML function, or by parsing a file with something like:
import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()
Or any of the many other ways shown at ElementTree
. Then do something like:
for type_tag in root.findall('bar/type'):
value = type_tag.get('foobar')
print(value)
Output:
1
2
Upvotes: 914
Reputation: 22490
These are some pros of the two most used libraries I would have benefit to know before choosing between them.
standalone="no"
? .node
.sourceline
allows to easily get the line of the XML element you are using.Upvotes: 11
Reputation: 69
#If the xml is in the form of a string as shown below then
from lxml import etree, objectify
'''sample xml as a string with a name space {http://xmlns.abc.com}'''
message =b'<?xml version="1.0" encoding="UTF-8"?>\r\n<pa:Process xmlns:pa="http://xmlns.abc.com">\r\n\t<pa:firsttag>SAMPLE</pa:firsttag></pa:Process>\r\n' # this is a sample xml which is a string
print('************message coversion and parsing starts*************')
message=message.decode('utf-8')
message=message.replace('<?xml version="1.0" encoding="UTF-8"?>\r\n','') #replace is used to remove unwanted strings from the 'message'
message=message.replace('pa:Process>\r\n','pa:Process>')
print (message)
print ('******Parsing starts*************')
parser = etree.XMLParser(remove_blank_text=True) #the name space is removed here
root = etree.fromstring(message, parser) #parsing of xml happens here
print ('******Parsing completed************')
dict={}
for child in root: # parsed xml is iterated using a for loop and values are stored in a dictionary
print(child.tag,child.text)
print('****Derving from xml tree*****')
if child.tag =="{http://xmlns.abc.com}firsttag":
dict["FIRST_TAG"]=child.text
print(dict)
### output
'''************message coversion and parsing starts*************
<pa:Process xmlns:pa="http://xmlns.abc.com">
<pa:firsttag>SAMPLE</pa:firsttag></pa:Process>
******Parsing starts*************
******Parsing completed************
{http://xmlns.abc.com}firsttag SAMPLE
****Derving from xml tree*****
{'FIRST_TAG': 'SAMPLE'}'''
Upvotes: 1
Reputation: 353
XML:
<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
Python code:
import xml.etree.cElementTree as ET
tree = ET.parse("foo.xml")
root = tree.getroot()
root_tag = root.tag
print(root_tag)
for form in root.findall("./bar/type"):
x=(form.attrib)
z=list(x)
for i in z:
print(x[i])
Output:
foo
1
2
Upvotes: 11
Reputation: 703
import xml.etree.ElementTree as ET
data = '''<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>'''
tree = ET.fromstring(data)
lst = tree.findall('bar/type')
for item in lst:
print item.get('foobar')
This will print the value of the foobar
attribute.
Upvotes: 9
Reputation: 169
Here a very simple but effective code using cElementTree
.
try:
import cElementTree as ET
except ImportError:
try:
# Python 2.5 need to import a different module
import xml.etree.cElementTree as ET
except ImportError:
exit_err("Failed to import cElementTree from any known place")
def find_in_tree(tree, node):
found = tree.find(node)
if found == None:
print "No %s in file" % node
found = []
return found
# Parse a xml file (specify the path)
def_file = "xml_file_name.xml"
try:
dom = ET.parse(open(def_file, "r"))
root = dom.getroot()
except:
exit_err("Unable to open and parse input definition file: " + def_file)
# Parse to find the child nodes list of node 'myNode'
fwdefs = find_in_tree(root,"myNode")
This is from "python xml parse".
Upvotes: 11
Reputation: 738
Just to add another possibility, you can use untangle, as it is a simple xml-to-python-object library. Here you have an example:
Installation:
pip install untangle
Usage:
Your XML file (a little bit changed):
<foo>
<bar name="bar_name">
<type foobar="1"/>
</bar>
</foo>
Accessing the attributes with untangle
:
import untangle
obj = untangle.parse('/path_to_xml_file/file.xml')
print obj.foo.bar['name']
print obj.foo.bar.type['foobar']
The output will be:
bar_name
1
More information about untangle can be found in "untangle".
Also, if you are curious, you can find a list of tools for working with XML and Python in "Python and XML". You will also see that the most common ones were mentioned by previous answers.
Upvotes: 19
Reputation: 33779
Python has an interface to the expat XML parser.
xml.parsers.expat
It's a non-validating parser, so bad XML will not be caught. But if you know your file is correct, then this is pretty good, and you'll probably get the exact info you want and you can discard the rest on the fly.
stringofxml = """<foo>
<bar>
<type arg="value" />
<type arg="value" />
<type arg="value" />
</bar>
<bar>
<type arg="value" />
</bar>
</foo>"""
count = 0
def start(name, attr):
global count
if name == 'type':
count += 1
p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)
print count # prints 4
Upvotes: 21
Reputation: 561
I might suggest declxml.
Full disclosure: I wrote this library because I was looking for a way to convert between XML and Python data structures without needing to write dozens of lines of imperative parsing/serialization code with ElementTree.
With declxml, you use processors to declaratively define the structure of your XML document and how to map between XML and Python data structures. Processors are used to for both serialization and parsing as well as for a basic level of validation.
Parsing into Python data structures is straightforward:
import declxml as xml
xml_string = """
<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
"""
processor = xml.dictionary('foo', [
xml.dictionary('bar', [
xml.array(xml.integer('type', attribute='foobar'))
])
])
xml.parse_from_string(processor, xml_string)
Which produces the output:
{'bar': {'foobar': [1, 2]}}
You can also use the same processor to serialize data to XML
data = {'bar': {
'foobar': [7, 3, 21, 16, 11]
}}
xml.serialize_to_string(processor, data, indent=' ')
Which produces the following output
<?xml version="1.0" ?>
<foo>
<bar>
<type foobar="7"/>
<type foobar="3"/>
<type foobar="21"/>
<type foobar="16"/>
<type foobar="11"/>
</bar>
</foo>
If you want to work with objects instead of dictionaries, you can define processors to transform data to and from objects as well.
import declxml as xml
class Bar:
def __init__(self):
self.foobars = []
def __repr__(self):
return 'Bar(foobars={})'.format(self.foobars)
xml_string = """
<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
"""
processor = xml.dictionary('foo', [
xml.user_object('bar', Bar, [
xml.array(xml.integer('type', attribute='foobar'), alias='foobars')
])
])
xml.parse_from_string(processor, xml_string)
Which produces the following output
{'bar': Bar(foobars=[1, 2])}
Upvotes: 16
Reputation: 14121
lxml.objectify is really simple.
Taking your sample text:
from lxml import objectify
from collections import defaultdict
count = defaultdict(int)
root = objectify.fromstring(text)
for item in root.bar.type:
count[item.attrib.get("foobar")] += 1
print dict(count)
Output:
{'1': 1, '2': 1}
Upvotes: 41