user3835980
user3835980

Reputation: 203

Web scraping a page after it's loaded its data

Trying to collect data on book price fluctuations for a school project. I'm using Python to scrape from a book buyback aggregator (in this case, bookscouter), but I find that since the site has to load in the data, grabbing the source code through the urllib2 package gives me the source code from before the data is loaded. How do I pull from after the data is loaded?

Example: http://bookscouter.com/prices.php?isbn=9788498383621&searchbutton=Sell

Upvotes: 2

Views: 4284

Answers (2)

user764357
user764357

Reputation:

The challenge is reading the data once its been rendered by a web browser, which will require some extra tricks to do. If you can see if the site has a pre-rendered version* or an API.

This article (linked from the Web archive) has a pretty good breakdown of what you'll need to do. It can be summed up however as:

  1. Pick a good python-webkit renderer (in the case of the article PyQT)
  2. Use a windowing widget to fetch and render the page
  3. Fetch the rendered HTML from the widget
  4. Parse this HTML as normal using a library like lXML or BeautifulSoup.

* Minor rant - the idea of having to hope for a pre-rendered version ofwhat should be a static webpage angers me.

Upvotes: 1

loretoparisi
loretoparisi

Reputation: 16281

You cannot this with Python only. You need a JavaScript engine API like PhantomJS

With Phantom, will be very easy to setup the web scraping of all the page contents, static and dynamic JavaScript contents (like Ajax calls results in your case). Infact you can register page event handlers to your page parser like (this is a node.js + phantom.js example)

/*
     * Register Page Handlers as functions

    {
        onLoadStarted : onLoadStarted,
        onLoadFinished: onLoadFinished,
        onError : onError,
        onResourceRequested : onResourceRequested,
        onResourceReceived : onResourceReceived,
        onNavigationRequested : onNavigationRequested,
        onResourceError : onResourceError
    }

    */
    registerHandlers : function(page, handlers) {
        if(handlers.onLoadStarted) page.set('onLoadStarted',handlers.onLoadStarted)
        if(handlers.onLoadFinished) page.set('onLoadFinished',handlers.onLoadFinished)
        if(handlers.resourceError) page.set('onResourceError', handlers.resourceError)
        if(handlers.onResourceRequested) page.set('onResourceRequested',handlers.onResourceRequested)
        if(handlers.onResourceReceived) page.set('onResourceReceived',handlers.onResourceReceived)
        if(handlers.onNavigationRequested) page.set('onNavigationRequested',handlers.onNavigationRequested)
        if(handlers.onError) page.set('onError',handlers.onError)

    }

At this point you have full control of what is going on and when in the page you have to download like:

var onResourceError = function(resourceError) {
                        var errorReason = resourceError.errorString;
                        var errorPageUrl = resourceError.url;
                }
                var onResourceRequested = function (request) {
                    var msg = '  request: ' + JSON.stringify(request, undefined, 4);
                };
                var onResourceReceived = function(response) {
                    var msg = '  id: ' + response.id + ', stage: "' + response.stage + '", response: ' + JSON.stringify(response);
                };
                var onNavigationRequested = function(url, type, willNavigate, main) {
                    var msg = '  destination_url: ' + url;
                    msg += '  type (cause): ' + type;
                    msg += '  will navigate: ' + willNavigate;
                    msg += '  from page\'s main frame: ' + main;
                };
                page.onResourceRequested(
                function(requestData, request) {
                        //request.abort()
                        //request.changeUrl(url)
                        //request.setHeader(key,value)
                        var msg = '  request: ' + JSON.stringify(request, undefined, 4);
                        //console.log( msg )
                    },
                function(requestData) {
                        //console.log(requestData.url)
                })
        PageHelper.registerHandlers(page,
                    {
                        onLoadStarted : onLoadStarted,
                        onLoadFinished: onLoadFinished,
                        onError : null, // onError THIS HANDLER CRASHES PHANTOM-NODE
                        onResourceRequested : null, // MUST BE ON PAGE OBJECT
                        onResourceReceived : onResourceReceived,
                        onNavigationRequested : onNavigationRequested,
                        onResourceError : onResourceError
                    });

As you can see you can define you page handlers and take control of the flow and so of the resources loaded on that page. So you can be sure that all data are ready and set, before you take the whole page source like:

var Parser = {
  parse : function(page) {

    var onSuccess = function (page) { // page loaded
        var pageContents=page.evaluate(function() {
            return document.body.innerText;
        });
      }
    var onError = function (page,elapsed) { // error
    }
    page.evaluate(function(func) {
            return func(document);
        }, function(dom) {
            return true;
        });

  }
} // Parser

Here you can see the whole page contents loaded in the onSuccess callback:

var pageContents=page.evaluate(function() {
                return document.body.innerText;
            });

The page comes from Phantomjs directly like in the following snippet:

phantom.create(function (ph) {
            ph.createPage(function (page) {
                Parser.parse(page)
            })
        },options)

Of course this to give you and idea of what you can do with node.js + Phantomjs, that are super powerful when combined together.

You can run phantomjs in a Python env, calling it like

try:
            output = ''
            for result in runProcess([self.runProcess,
            self.runScript,
            self.jobId,
            self.protocol,
            self.hostname,
            self.queryString]):
                output += '' + result
                print output
        except Exception as e:
            print e
            print(traceback.format_exc())

where you use subprocess Popen to execute the binary:

def runProcess(exe):
    p = subprocess.Popen(exe, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    while(True):
      retcode = p.poll() #returns None while subprocess is running
      line = p.stdout.readline()
      yield line
      if(retcode is not None):
        break

Of course the process to run is node.js in this case

self.runProcess='node'

with the args you need as params.

Upvotes: 2

Related Questions