Xodarap777
Xodarap777

Reputation: 1376

Crawling through pages with PostBack data javascript Python Scrapy

I'm crawling through some directories with ASP.NET programming via Scrapy.

The pages to crawl through are encoded as such:

javascript:__doPostBack('ctl00$MainContent$List','Page$X')

where X is an int between 1 and 180. The MainContent argument is always the same. I have no idea how to crawl into these. I would love to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack', but my guess is that I have to be trickier in order to pull the info from the javascript "link."

If it's easier to "unmask" each of the absolute links from the javascript code and save those to a csv, then use that csv to load requests into a new scraper, that's okay, too.

Upvotes: 10

Views: 5963

Answers (1)

alecxe
alecxe

Reputation: 474191

This kind of pagination is not that trivial as it may seem. It was an interesting challenge to solve it. There are several important notes about the solution provided below:

  • the idea here is to follow the pagination page by page passing around the current page in the Request.meta dictionary
  • using a regular BaseSpider since there is some logic involved in the pagination
  • it is important to provide headers pretending to be a real browser
  • it is important to yield FormRequests withdont_filter=True since we are basically making a POST request to the same URL but with different parameters

The code:

import re

from scrapy.http import FormRequest
from scrapy.spider import BaseSpider


HEADERS = {
    'X-MicrosoftAjax': 'Delta=true',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'


class ExitRealtySpider(BaseSpider):
    name = "exit_realty"

    allowed_domains = ["exitrealty.com"]
    start_urls = [URL]

    def parse(self, response):
        # submit a form (first page)
        self.data = {}
        for form_input in response.css('form#aspnetForm input'):
            name = form_input.xpath('@name').extract()[0]
            try:
                value = form_input.xpath('@value').extract()[0]
            except IndexError:
                value = ""
            self.data[name] = value

        self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
        self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
        self.data['__EVENTARGUMENT'] = 'Page$1'

        return FormRequest(url=URL,
                           method='POST',
                           callback=self.parse_page,
                           formdata=self.data,
                           meta={'page': 1},
                           dont_filter=True,
                           headers=HEADERS)

    def parse_page(self, response):
        current_page = response.meta['page'] + 1

        # parse agents (TODO: yield items instead of printing)
        for agent in response.xpath('//a[@class="regtext"]/text()'):
            print agent.extract()
        print "------"

        # request the next page
        data = {
            '__EVENTARGUMENT': 'Page$%d' % current_page,
            '__EVENTVALIDATION': re.search(r"__EVENTVALIDATION\|(.*?)\|", response.body, re.MULTILINE).group(1),
            '__VIEWSTATE': re.search(r"__VIEWSTATE\|(.*?)\|", response.body, re.MULTILINE).group(1),
            '__ASYNCPOST': 'true',
            '__EVENTTARGET': 'ctl00$MainContent$agentList',
            'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
            '': ''
        }

        return FormRequest(url=URL,
                           method='POST',
                           formdata=data,
                           callback=self.parse_page,
                           meta={'page': current_page},
                           dont_filter=True,
                           headers=HEADERS)

Upvotes: 19

Related Questions