Sitanshu.K
Sitanshu.K

Reputation: 15

how to make a POST request in Scrapy that requires Request payload

I am trying to parse data from this website.
In Network section of inspect element i found this link https://busfor.pl/api/v1/searches that is used for a POST request that returns JSON i am interested in.
But for making this POST request there is request Payload with some dictionary.
I assumed it like normal formdata that we use to make FormRequest in scrapy but it returns 403 error.

I have already tried the following.

url = "https://busfor.pl/api/v1/searches"
formdata = {"from_id" : d_id
                ,"to_id" : a_id
                ,"on" : '2019-10-10'
                ,"passengers" : 1
                ,"details" : []
}
yield scrapy.FormRequest(url, callback=self.parse, formdata=formdata)

This returns 403 Error
I also tried this by referring to one of the StackOverflow post.

url = "https://busfor.pl/api/v1/searches"
payload = [{"from_id" : d_id
                ,"to_id" : a_id
                ,"on" : '2019-10-10'
                ,"passengers" : 1
                ,"details" : []
}]
yield scrapy.Request(url, self.parse, method = "POST", body = json.dumps(payload))

But even this returns the same error.
Can someone help me. to figure out how to parse the required data using Scrapy.

Upvotes: 0

Views: 3034

Answers (1)

The way to send POST requests with json data is the later, but you are passing a wrong json to the site, it expects a dictionary, not a list of dictionaries. So instead of:

payload = [{"from_id" : d_id
                ,"to_id" : a_id
                ,"on" : '2019-10-10'
                ,"passengers" : 1
                ,"details" : []
}]

You should use:

payload = {"from_id" : d_id
                ,"to_id" : a_id
                ,"on" : '2019-10-10'
                ,"passengers" : 1
                ,"details" : []
}

Another thing you didn't notice are the headers passed to the POST request, sometimes the site uses IDs and hashes to control access to their API, in this case I found two values that appear to be needed, X-CSRF-Token and X-NewRelic-ID. Luckily for us these two values are available on the search page.

Here is a working spider, the search result is available at the method self.parse_search.

import json
import scrapy

class BusForSpider(scrapy.Spider):
    name = 'busfor'
    start_urls = ['https://busfor.pl/autobusy/Sopot/Gda%C5%84sk?from_id=62113&on=2019-10-09&passengers=1&search=true&to_id=3559']
    search_url = 'https://busfor.pl/api/v1/searches'

    def parse(self, response):
        payload = {"from_id" : '62113',
                   "to_id" : '3559',
                   "on" : '2019-10-10',
                   "passengers" : 1,
                   "details" : []}
        csrf_token = response.xpath('//meta[@name="csrf-token"]/@content').get()
        newrelic_id = response.xpath('//script/text()').re_first(r'xpid:"(.*?)"')
        headers = {
            'X-CSRF-Token': csrf_token,
            'X-NewRelic-ID': newrelic_id,
            'Content-Type': 'application/json; charset=UTF-8',
        }
        yield scrapy.Request(self.search_url, callback=self.parse_search, method="POST", body=json.dumps(payload), headers=headers)

    def parse_search(self, response):
        data = json.loads(response.text)

Upvotes: 4

Related Questions