Reputation: 349
I am having difficult parsing the xml _file below using Ixml:
>>_file= "qv.xml"
file content:
<document reference="suspicious-document00500.txt">
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="128" this_length="2503" source_reference="source-document00500.txt" source_offset="138339" source_length="2503"/>
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="8593" this_length="1582" source_reference="source-document00500.txt" source_offset="49473" source_length="1582"/>
</document>
Here is my attempt:
>>from lxml.etree import XMLParser, parse
>>parsefile = parse(_file)
>>print parsefile
Output: <lxml.etree._ElementTree object at 0x000000000642E788>
The output is the location of the ixml object, while I am after the actual file content ie
Desired output={'document reference'="suspicious-document00500.txt", 'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
Any ideas on how to get the desired output? thanks.
Upvotes: 0
Views: 194
Reputation: 2295
Here's one way of getting the desired outputs:
from lxml import etree
def main():
doc = etree.parse('qv.xml')
root = doc.getroot()
print root.attrib
for item in root:
print item.attrib
if __name__ == "__main__":
main()
Output:
{'reference': 'suspicious-document00500.txt'}
{'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
{'this_offset': '8593', 'obfuscation': 'none', 'source_length': '1582', 'name': 'plagiarism', 'this_length': '1582', 'source_reference': 'source-document00500.txt', 'source_offset': '49473', 'type': 'artificial'}
It works fine with the contents you gave.
You might want to read thisto see how etree
represents xml
objects.
Upvotes: 1