Reputation: 155
I am trying to scrap the table of this webpage (https://www.ftse.com/pr oducts/indices/uk). When I inspect the page in the Network tab, I see this page fetches its data to an API with AJAX requests (type POST), which are done by the browser after the layout is loaded. So I am trying to build a spider which send POST requests to the webpage using the form_data given in the request. I have tested quickly with the following shell command and I get the data.
curl 'https://www.ftse.com/products/indices/home/ra_getIndexData/' --data 'indexName=GEISAC¤cy=GBP&rtn=CAPITAL&ctry=Regions&Indices=ASX%2CFTSE+All-Share%2C%3AUKX%2CFTSE+100%2C%3AMCX%2CFTSE+250%2C%3AMCXNUK%2CFTSE+250+Net+Tax%2C%3ANMX%2CFTSE+350%2C%3ASMX%2CFTSE+Small+Cap%2C%3ANSX%2CFTSE+Fledgling%2C%3AAS0%2CFTSE+All-Small%2C%3AASXX%2CFTSE+All-Share+ex+Invt+Trust%2C%3AUKXXIT%2CFTSE+100+Index+ex+Invt+Trust%2C%3AMCIX%2CFTSE+250+Index+ex+Invt+Trust%2C%3ANMIX%2CFTSE+350+Index+ex+Invt+Trust%2C%3ASMXX%2CFTSE+Small+Cap+ex+Invt+Trust%2C%3AAS0X%2CFTSE+All-Small+ex+Invt+Trust%2C%3AUKXDUK%2CFTSE+100+Total+Return+Declared+Dividend%2C%3A&type='
However when I try to code it on the spider using FormRequest classes the spider fails.
class FtseSpider(scrapy.Spider):
name = 'ftse'
#allowed_domains = ['www.ftserussell.com', 'www.ftse.com']
start_urls = [
'https://www.ftse.com/products/indices/uk']
def parse(self, request):
# URL parameters for the requst
data = 'indexName=GEISAC¤cy=GBP&rtn=CAPITAL&ctry=Regions&Indices=ASX%2CFTSE+All-Share%2C%3AUKX%2CFTSE+100%2C%3AMCX%2CFTSE+250%2C%3AMCXNUK%2CFTSE+250+Net+Tax%2C%3ANMX%2CFTSE+350%2C%3ASMX%2CFTSE+Small+Cap%2C%3ANSX%2CFTSE+Fledgling%2C%3AAS0%2CFTSE+All-Small%2C%3AASXX%2CFTSE+All-Share+ex+Invt+Trust%2C%3AUKXXIT%2CFTSE+100+Index+ex+Invt+Trust%2C%3AMCIX%2CFTSE+250+Index+ex+Invt+Trust%2C%3ANMIX%2CFTSE+350+Index+ex+Invt+Trust%2C%3ASMXX%2CFTSE+Small+Cap+ex+Invt+Trust%2C%3AAS0X%2CFTSE+All-Small+ex+Invt+Trust%2C%3AUKXDUK%2CFTSE+100+Total+Return+Declared+Dividend%2C%3A&type='`
# convert the URL parameters in to a dict
params_raw_ = urllib.parse.parse_qs(data)
prams_dict_ = {k: v[0] for k, v in params_raw_.items()}
# return the response
yield [scrapy.FormRequest('https://www.ftse.com/products/indices/home/ra_getIndexData/',
method='POST',
body=prams_dict_)]
Upvotes: 0
Views: 682
Reputation: 649
since the data has nested dictionaries it can not be represented as formdata in scrapy, we must pass the json dump in the body of the request which is equal to the initial representation of the "data". Also use yield from when yielding an iterator or use a single object or Request to yield instead.
yield from [scrapy.FormRequest('https://www.ftse.com/products/indices/home/ra_getIndexData/',
method='POST', body=data)]
Upvotes: 1