Reputation: 315
I am extracting image data from Flickr via their API and what I get printed is a few thousand xml objects that look like this:
<photo accuracy="15" context="0" dateupload="1398279194" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="320" id="13986079375" isfamily="0" isfriend="0" ispublic="1" latitude="41.828482" license="0" longitude="-87.624506" owner="100231432@N02" pathalias="perspectivesschools" place_id="cF8n.mJTWrhYf0uBEw" secret="f46eef0b1d" server="7308" title="Sean Gallagher, Pulitzer Photojournalist visits MSA" url_n="https://farm8.staticflickr.com/7308/13986079375_f46eef0b1d_n.jpg" width_n="213" woeid="28297331" />
<photo accuracy="12" context="0" dateupload="1394558054" farm="4" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086071753" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="265103ac38" server="3040" title="" url_n="https://farm4.staticflickr.com/3040/13086071753_265103ac38_n.jpg" width_n="320" woeid="13978" />
<photo accuracy="12" context="0" dateupload="1394558019" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086343854" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="a6858f84d2" server="7451" title="" url_n="https://farm8.staticflickr.com/7451/13086343854_a6858f84d2_n.jpg" width_n="320" woeid="13978" />
Now I want to extract data for attributes 'lat' and 'long' in one run. And the data for the attribute 'url_n' in the other. How can I do that in Python? I have no experience with parsing xml data and don't know where to start.
Thanks a lot!
Upvotes: 1
Views: 193
Reputation: 44142
While there are multiple XML related packages in Python, incl. stdlib one, I prefer using lxml
, as
it offers all what I need (good XPath support, schema validation etc.) and I prefer to keep number
of packages I use small.
For the xml documents from Flickr, the solution could look like
flickr.py
from lxml import etree
xmllines = """
<photo accuracy="15" context="0" dateupload="1398279194" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="320" id="13986079375" isfamily="0" isfriend="0" ispublic="1" latitude="41.828482" license="0" longitude="-87.624506" owner="100231432@N02" pathalias="perspectivesschools" place_id="cF8n.mJTWrhYf0uBEw" secret="f46eef0b1d" server="7308" title="Sean Gallagher, Pulitzer Photojournalist visits MSA" url_n="https://farm8.staticflickr.com/7308/13986079375_f46eef0b1d_n.jpg" width_n="213" woeid="28297331" />
<photo accuracy="12" context="0" dateupload="1394558054" farm="4" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086071753" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="265103ac38" server="3040" title="" url_n="https://farm4.staticflickr.com/3040/13086071753_265103ac38_n.jpg" width_n="320" woeid="13978" />
<photo accuracy="12" context="0" dateupload="1394558019" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086343854" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="a6858f84d2" server="7451" title="" url_n="https://farm8.staticflickr.com/7451/13086343854_a6858f84d2_n.jpg" width_n="320" woeid="13978" />
"""
for line in xmllines.strip().splitlines():
doc = etree.fromstring(line)
urls = doc.xpath("/photo/@url_n")
if urls:
url = urls[0]
print url
else:
print "---no attribute url_n was found---"
which would output:
$ python flickr.py
https://farm8.staticflickr.com/7308/13986079375_f46eef0b1d_n.jpg
https://farm4.staticflickr.com/3040/13086071753_265103ac38_n.jpg
https://farm8.staticflickr.com/7451/13086343854_a6858f84d2_n.jpg
Upvotes: 1
Reputation: 7691
Parsing XML with regex is not a good idea. Try BeautifulSoup - it not only parses XML, but it also has functions to get the next/parent/etc element in relation to one selected and their attributes easily.
Example use:
from bs4 import BeautifulSoup
(...)
soup = BeautifulSoup(flickr_xml)
for photo in soup.find_all('photo'):
print(photo.get('url_n'))
Upvotes: 1