Phill Pafford
Phill Pafford

Reputation: 85378

Python (newbie) Parse XML from API call

I've look for some tutorials/other questions on stack/documentation and still can't figure it out. ugh!!!

Making the API request and the parsing out (want to assign to variables but that's a bonus to this question), This is what I'm trying. Why can't I list the title and link for the items?

#!/usr/bin/python

# Screen Scraper for Subs
import urllib
from xml.etree import ElementTree as ET

show = 'heroes'
season = '4'
language = 'en'
limit = '1'

requestURL = 'http://api.allsubs.org/index.php?' \
           + 'search=' + show \
           + '+season+' + season \
           + '&language=' + language \
           + '&limit=' + limit

root = ET.parse(urllib.urlopen(requestURL)).getroot()
print root
print '\n'

items = root.findall('items')
for item in items: 
     item.find('title').text # should print: <![CDATA[Heroes Season 4 Subtitles]]>
     item.find('link').text # Should print: http://www.allsubs.org/subs-download/heroes+season+4/1223435/

XML Response

        <AllSubsAPI> 
        <title>AllSubs API: Subtitles Search</title> 
        <link>http://www.allsubs.org</link> 
        <description><![CDATA[Subtitles Search for Heroes Season 4]]></description> 
        <language>en-us</language> 
        <results>1</results> 
        <found_results>24</found_results> 
<items> 
    <item> 
            <title><![CDATA[Heroes Season 4 Subtitles]]></title> 
            <link>http://www.allsubs.org/subs-download/heroes+season+4/1223435/</link> 
            <filename>heroes-season-4-english-heroes-season-4-en.zip</filename> 
            <files_in_archive>Heroes - 4x01-02 - Orientation.HDTV.FQM.en.srt|Heroes - 4x17 - The Art of Deception.HDTV.2HD.en.srt|Heroes - 4x07 - Strange Attractors.HDTV.LOL.en.srt|Heroes - 4x08 - Once Upon a Time in Texas.HDTV.2HD.en.srt|Heroes - 4x07 - Strange Attractors.720p HDTV.DIMENSION.en.srt|Heroes - 4x05 - Hysterical Blindness.720p HDTV.X264.en.srt|Heroes - 4x09 - Shadowboxing.HDTV.LOL.en.srt|Heroes - 4x16 - Pass Fail.HDTV.LOL.en.srt|Heroes - 4x04 - Acceptance.HDTV.en.srt|Heroes - 4x01-02 - Orientation.720p HDTV.DIMENSION.en.srt|Heroes - 4x06 - Tabula Rasa.HDTV.NoTV.en.srt|Heroes - 4x10 - Brother's Keeper.HDTV.FQM.en.srt|Heroes - 4x04 - Acceptance.HDTV.FQM.en.srt|Heroes - 4x14 - Let It Bleed.720p HDTV.DIMENSION.en.srt|Heroes - 4x06 - Tabula Rasa.720p HDTV.SiTV.en.srt|Heroes - 4x08 - Once Upon a Time in Texas.HDTV.NoTV.en.srt|Heroes - 4x12 - The Fifth Stage.HDTV.LOL.en.srt|Heroes - 4x19 - Brave New World.HDTV.LOL.en.srt|Heroes - 4x15 - Close to You.720p HDTV.DIMENSION.en.srt|Heroes - 4x03 - Ink.720p HDTV.DIMENSION.en.srt|Heroes - 4x11 - Thanksgiving.720p HDTV.DIMENSION.en.srt|Heroes - 4x13 - Upon This Rock.720p HDTV.DIMENSION.en.srt|Heroes - 4x13 - Upon This Rock.HDTV.LOL.en.srt|Heroes - 4x14 - Let It Bleed.HDTV.LOL.en.srt|Heroes - 4x15 - Close to You.HDTV.LOL.en.srt|Heroes - 4x12 - The Fifth Stage.720p HDTV.DIMENSION.en.srt|Heroes - 4x18 - The Wall.HDTV.LOL.en.srt|Heroes - 4x08 - Once Upon a Time in Texas.720p HDTV.CTU.en.srt|Heroes - 4x17 - The Art of Deception.HDTV.CTU.en.srt|Heroes - 4x09 - Shadowboxing.720p HDTV.DIMENSION.en.srt|Heroes - 4x10 - Brother's Keeper.720p HDTV.DIMENSION.en.srt|Heroes - 4x04 - Acceptance.720p HDTV.CTU.en.srt|Heroes - 4x11 - Thanksgiving.HDTV.FQM.en.srt|Heroes - 4x03 - Ink.HDTV.FQM.en.srt|Heroes - 4x05 - Hysterical Blindness.HDTV.XII.en.srt|</files_in_archive> 
            <languages>en</languages> 
            <added_on>2010-02-16</added_on> 
    </item> 

</items> 
</AllSubsAPI>

UPDATE:

This worked, thanks for the help and pointing out my typo

items = root.findall('items/item')
for item in items: 
     print item.find('title').text
     print item.find('link').text

Upvotes: 7

Views: 16619

Answers (3)

Spacedman
Spacedman

Reputation: 94307

This works for me. Note I'm using urllib2 to get through a proxy:

import urllib2
from xml.etree import ElementTree as ET

show = 'heroes'
season = '4'
language = 'en'
limit = '1'

requestURL = 'http://api.allsubs.org/index.php?' \
           + 'search=' + show \
           + '+season+' + season \
           + '&language=' + language \
           + '&limit=' + limit

root = ET.parse(urllib2.urlopen(requestURL)).getroot()
print root
print '\n'

items = root.findall('items')[0].findall('item')
for item in items: 
     print item.find('title').text # should print: <![CDATA[Heroes Season 4 Subtitles]]>
     print item.find('link').text # Should print: http://www.allsubs.org/subs-download/heroes+season+4/1223435/

note that findall('items') finds the "items" tag, what you want to loop over (I think) are the "item" tags within that, so we findall() of those. Also, you need to print to get anything out of python.

Also, if I do it with limit=2, I get a:

Traceback (most recent call last):
  File "heros.py", line 18, in <module>
    root = ET.parse(urllib2.urlopen(requestURL)).getroot()
  File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 862, in parse
    tree.parse(source, parser)
  File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 586, in parse
    parser.feed(data)
  File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 1245, in feed
    self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 24, column 95

I'm not sure the XML coming back from this API is well-formed - there's no "xml" element at the start for starters. I wouldn't trust it...

Upvotes: 3

Mikeyg36
Mikeyg36

Reputation: 2838

You are not iterating over the 'item' elements, you are in fact iterating over the 'items' elements.

I think it should be:

items = root.findall('items')
childItems = items.findall('item')
for childItem in childItems: 
    childItem.find('title').text # should print: <![CDATA[Heroes Season 4 Subtitles]]>
    childItem.find('link').text # Should print: http://www.allsubs.org/subs-download/heroes+season+4/1223435

Upvotes: 2

Adam Vandenberg
Adam Vandenberg

Reputation: 20671

items = root.findall('items')

should be

items = root.findall('items/item')

Upvotes: 4

Related Questions