user6003897
user6003897

Reputation:

parsing an xml file with python

I am trying to parse an xml file that is located in the same folder as my python script but when I run the script it does not print in the terminal as it's supposed to. I am using ElementTree here is my code:

import xml.etree.ElementTree

f = xml.etree.ElementTree.parse('atom.xml').getroot()
for atype in f.findall('link'):
   print(atype.get('href'))

this is what I want to get from the xml the href

<?xml version='1.0' ?>
 <feed xmlns="http://www.w3.org/2005/Atom">
 <title type="text">Gwern</title>
 <id>https://www.gwern.net/</id>
 <updated>2017-07-22T14:57:39Z</updated>
 <link href="https://www.gwern.net/atom.xml" rel="self" />
<author>
<name>gwern</name>
</author>
<author>
 <name>ujdRR</name>
</author>
 <generator uri="http://github.com/jgm/gitit"    version="HEAD">gitit</generator>
<entry>
<id>https://www.gwern.net/Mail%20delivery?   utm_source=RSS&amp;utm_medium=feed&amp;utm_campaign=1</id>
  <title type="text">Modified &quot;Mail delivery.page&quot;, Modified   &quot;Mistakes.page&quot;, Modified &quot;Nootropics.page&quot;, Modified &quot;Touhou.page&quot;, Modified &quot;Wikipedia resume.page&quot;,         &quot;Zeo.page&quot;, Modified &quot;hakyll.hs&quot;, Modified &quot;newsletter/2017/06.page&quot;, Modified &quot;the-long-stagnation.page&quot;, Modified &quot;wittgenstein-thesis.page&quot;</title>
<updated>2017-06-25T04:00:06Z</updated>
<author>
  <name>gwern</name>
</author>
<link href="https://www.gwern.net/Mail%20delivery?utm_source=RSS&amp;utm_medium=feed&amp;utm_campaign=1" rel="alternate" />
<summary type="text">record all minor pending edits</summary>

Upvotes: 0

Views: 908

Answers (1)

stovfl
stovfl

Reputation: 15513

Question: ... what I want to get from the xml the href

Your XML has a Namespace: <feed xmlns="http://www.w3.org/2005/Atom">',
therefore you have to use a Namespace Parameter with findall.
Second, the XML has Two <link ...> Tags, One Inside a <entry> Tag.

findall(self, path, namespaces=None)
Finds all elements matching the ElementPath expression. Same as getroot().findall(path).
The optional namespaces argument accepts a prefix-to-namespace mapping that allows the usage of XPath prefixes in the path expression.

root = tree.getroot()
namespaces = {
'xmlns':"http://www.w3.org/2005/Atom"
}

# Get the First <link ...> Outside <entry>
link = root.findall('./xmlns:link', namespaces)[0]
print('link:{} {}'.format(link, link.get('href')))

# Find all <link ...> Inside <entry>
for link in root.findall('./xmlns:entry/xmlns:link', namespaces):
    print(link.get('href'))

Output:

link:<Element {http://www.w3.org/2005/Atom}link at 0xf6a6d8ac> https://www.gwern.net/atom.xml
https://www.gwern.net/Mail%20delivery?utm_source=RSS&utm_medium=feed&utm_campaign=1

Tested with Python: 3.4.2

Upvotes: 3

Related Questions