roberto swiss
roberto swiss

Reputation: 155

SCRAPY SPIDER - Send Post Request

I am trying to scrap the table of this webpage (https://www.ftse.com/pr oducts/indices/uk). When I inspect the page in the Network tab, I see this page fetches its data to an API with AJAX requests (type POST), which are done by the browser after the layout is loaded. So I am trying to build a spider which send POST requests to the webpage using the form_data given in the request. I have tested quickly with the following shell command and I get the data.

curl 'https://www.ftse.com/products/indices/home/ra_getIndexData/' --data 'indexName=GEISAC&currency=GBP&rtn=CAPITAL&ctry=Regions&Indices=ASX%2CFTSE+All-Share%2C%3AUKX%2CFTSE+100%2C%3AMCX%2CFTSE+250%2C%3AMCXNUK%2CFTSE+250+Net+Tax%2C%3ANMX%2CFTSE+350%2C%3ASMX%2CFTSE+Small+Cap%2C%3ANSX%2CFTSE+Fledgling%2C%3AAS0%2CFTSE+All-Small%2C%3AASXX%2CFTSE+All-Share+ex+Invt+Trust%2C%3AUKXXIT%2CFTSE+100+Index+ex+Invt+Trust%2C%3AMCIX%2CFTSE+250+Index+ex+Invt+Trust%2C%3ANMIX%2CFTSE+350+Index+ex+Invt+Trust%2C%3ASMXX%2CFTSE+Small+Cap+ex+Invt+Trust%2C%3AAS0X%2CFTSE+All-Small+ex+Invt+Trust%2C%3AUKXDUK%2CFTSE+100+Total+Return+Declared+Dividend%2C%3A&type='

However when I try to code it on the spider using FormRequest classes the spider fails.

class FtseSpider(scrapy.Spider):
    name = 'ftse'
    #allowed_domains = ['www.ftserussell.com', 'www.ftse.com']
    start_urls = [
            'https://www.ftse.com/products/indices/uk']


    def parse(self, request):
        # URL parameters for the requst
        data = 'indexName=GEISAC&currency=GBP&rtn=CAPITAL&ctry=Regions&Indices=ASX%2CFTSE+All-Share%2C%3AUKX%2CFTSE+100%2C%3AMCX%2CFTSE+250%2C%3AMCXNUK%2CFTSE+250+Net+Tax%2C%3ANMX%2CFTSE+350%2C%3ASMX%2CFTSE+Small+Cap%2C%3ANSX%2CFTSE+Fledgling%2C%3AAS0%2CFTSE+All-Small%2C%3AASXX%2CFTSE+All-Share+ex+Invt+Trust%2C%3AUKXXIT%2CFTSE+100+Index+ex+Invt+Trust%2C%3AMCIX%2CFTSE+250+Index+ex+Invt+Trust%2C%3ANMIX%2CFTSE+350+Index+ex+Invt+Trust%2C%3ASMXX%2CFTSE+Small+Cap+ex+Invt+Trust%2C%3AAS0X%2CFTSE+All-Small+ex+Invt+Trust%2C%3AUKXDUK%2CFTSE+100+Total+Return+Declared+Dividend%2C%3A&type='`
        # convert the URL parameters in to a dict 
        params_raw_ = urllib.parse.parse_qs(data)
        prams_dict_ = {k: v[0] for k, v in params_raw_.items()}
        # return the response
        yield [scrapy.FormRequest('https://www.ftse.com/products/indices/home/ra_getIndexData/',
                    method='POST',
                    body=prams_dict_)]

Upvotes: 0

Views: 682

Answers (1)

ruhaib
ruhaib

Reputation: 649

since the data has nested dictionaries it can not be represented as formdata in scrapy, we must pass the json dump in the body of the request which is equal to the initial representation of the "data". Also use yield from when yielding an iterator or use a single object or Request to yield instead.

yield from [scrapy.FormRequest('https://www.ftse.com/products/indices/home/ra_getIndexData/',
                method='POST', body=data)]

Upvotes: 1

Related Questions