Reputation:
I am trying to parse an xml file that is located in the same folder as my python script but when I run the script it does not print in the terminal as it's supposed to. I am using ElementTree here is my code:
import xml.etree.ElementTree
f = xml.etree.ElementTree.parse('atom.xml').getroot()
for atype in f.findall('link'):
print(atype.get('href'))
this is what I want to get from the xml the href
<?xml version='1.0' ?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title type="text">Gwern</title>
<id>https://www.gwern.net/</id>
<updated>2017-07-22T14:57:39Z</updated>
<link href="https://www.gwern.net/atom.xml" rel="self" />
<author>
<name>gwern</name>
</author>
<author>
<name>ujdRR</name>
</author>
<generator uri="http://github.com/jgm/gitit" version="HEAD">gitit</generator>
<entry>
<id>https://www.gwern.net/Mail%20delivery? utm_source=RSS&utm_medium=feed&utm_campaign=1</id>
<title type="text">Modified "Mail delivery.page", Modified "Mistakes.page", Modified "Nootropics.page", Modified "Touhou.page", Modified "Wikipedia resume.page", "Zeo.page", Modified "hakyll.hs", Modified "newsletter/2017/06.page", Modified "the-long-stagnation.page", Modified "wittgenstein-thesis.page"</title>
<updated>2017-06-25T04:00:06Z</updated>
<author>
<name>gwern</name>
</author>
<link href="https://www.gwern.net/Mail%20delivery?utm_source=RSS&utm_medium=feed&utm_campaign=1" rel="alternate" />
<summary type="text">record all minor pending edits</summary>
Upvotes: 0
Views: 908
Reputation: 15513
Question: ... what I want to get from the xml the href
Your XML
has a Namespace: <feed xmlns="http://www.w3.org/2005/Atom">'
,
therefore you have to use a Namespace Parameter with findall
.
Second, the XML
has Two <link ...>
Tags, One Inside a <entry>
Tag.
findall(self, path, namespaces=None)
Finds all elements matching the ElementPath expression. Same as getroot().findall(path).
The optional namespaces argument accepts a prefix-to-namespace mapping that allows the usage of XPath prefixes in the path expression.
root = tree.getroot()
namespaces = {
'xmlns':"http://www.w3.org/2005/Atom"
}
# Get the First <link ...> Outside <entry>
link = root.findall('./xmlns:link', namespaces)[0]
print('link:{} {}'.format(link, link.get('href')))
# Find all <link ...> Inside <entry>
for link in root.findall('./xmlns:entry/xmlns:link', namespaces):
print(link.get('href'))
Output:
link:<Element {http://www.w3.org/2005/Atom}link at 0xf6a6d8ac> https://www.gwern.net/atom.xml https://www.gwern.net/Mail%20delivery?utm_source=RSS&utm_medium=feed&utm_campaign=1
Tested with Python: 3.4.2
Upvotes: 3