using xpath to parse images from the web

Question

I'm written a bit of code in an attempt to pull photos from a website. I want it to find photos, then download them for use to tweet:

import urllib2
from lxml.html import fromstring
import sys
import time

url = "http://www.phillyhistory.org/PhotoArchive/Search.aspx"

response = urllib2.urlopen(url)
html = response.read()
dom = fromstring(html)

sels = dom.xpath('//*[(@id = "large_media")]')

for pic in sels[:1]:

    output = open("file01.jpg","w")
    output.write(pic.read())
    output.close()




#twapi = tweepy.API(auth)
#twapi.update_with_media(imagefilename, status=xxx)

I'm new at this sort of thing, so I'm not really sure why this isn't working. No file is created, and no 'sels' are being created.

Lukas Graf · Accepted Answer

Your problem is that the image search (Search.aspx) doesn't just return a HTML page with all the content in it, but instead delivers a JavaScript application that then makes several subsequent requests (see AJAX) to fetch raw information about assets, and then builds a HTML page dynamically that contains all those search results.

You can observe this behavior by looking at the HTTP requests your browser makes when you load the page. Use the Firebug extension for Firefox or the builtin Chrome developer tools and open the Network tab. Look for requests that happen after the initial page load, particularly POST requests.

In this case the interesting requests are the ones to Thumbnails.ashx, Details.ashx and finally MediaStream.ashx. Once you identify those requests, look at what headers and form data your browser sends, and emulate that behavior with plain HTTP requests from Python.

The response from Thumbnails.ashx is actually JSON, so it's much easier to parse than HTML.

In this example I use the requests module because it's much, much better and easier to use than urllib(2). If you don't have it, install it with pip install requests.

Try this:

import requests
import urllib


BASE_URL = 'http://www.phillyhistory.org/PhotoArchive/'
QUERY_URL = BASE_URL + 'Thumbnails.ashx'
DETAILS_URL = BASE_URL + 'Details.ashx'


def get_media_url(asset_id):
    response = requests.post(DETAILS_URL, data={'assetId': asset_id})
    image_details = response.json()
    media_id = image_details['assets'][0]['medialist'][0]['mediaId']
    return '{}/MediaStream.ashx?mediaId={}'.format(BASE_URL, media_id)


def save_image(asset_id):
    filename = '{}.jpg'.format(asset_id)
    url = get_media_url(asset_id)

    with open(filename, 'wb') as f:
        response = requests.get(url)
        f.write(response.content)
    return filename


urlqs = {
    'maxx': '-8321310.550067',
    'maxy': '4912533.794965',
    'minx': '-8413034.983992',
    'miny': '4805521.955385',
    'onlyWithoutLoc': 'false',
    'sortOrderM': 'DISTANCE',
    'sortOrderP': 'DISTANCE',
    'type': 'area',
    'updateDays': '0',
    'withoutLoc': 'false',
    'withoutMedia': 'false'
}

data = {
    'start': 0,
    'limit': 12,
    'noStore': 'false',
    'request': 'Images',
    'urlqs': urllib.urlencode(urlqs)
}

response = requests.post(QUERY_URL, data=data)
result = response.json()

print '{} images found'.format(result['totalImages'])

for image in result['images']:
    asset_id = image['assetId']
    print 'Name: {}'.format(image['name'])
    print 'Asset ID: {}'.format(asset_id)

    filename = save_image(asset_id)
    print "Saved image to '{}'.
".format(filename)

Note: I didn't check what http://www.phillyhistory.org/'s Terms of Service have to say about automated crawling. You need to check yourself and make sure you're not in violation of their ToS with whatever you're doing.

using xpath to parse images from the web

Answers (1)

Related Questions