vferraz
vferraz

Reputation: 469

Scrapy: how scrape second HTML page requested through AJAX call

I am new to scrapy and html and I'm trying to create a simple spider to scrape the https://www.mobiel.nl website.

I have managed to access the mobile phones pages, e.g. https://www.mobiel.nl/smartphone/apple/iphone-6-32gb

I am trying to take information on the plans, such as operator names (taken from the image names), plan names and rates, which are stored in the following containers:

<div class="pc-result js-offer" data-offer-id="71-1928-3683-19.0">

I have tried dozens of different ways of writhing the selectors, such as:

 scrapy shell https://www.mobiel.nl/smartphone#
 fetch('https://www.mobiel.nl/smartphone/apple/iphone-6-32gb') 

In [37]: response.xpath('//*[@id="js-compare-results"]/text()')
Out[37]: []

In [38]: response.xpath('//*[@id="js-compare-results"]/*')
Out[38]: []

In [39]: response.xpath('//*[@id="js-compare-results"]')
Out[39]: []

In [40]: response.xpath('//*[@id="js-compare-results"]/div/div[2]/div[2]/div/div[1]/div/div[1]/div[1]/span[1]')
Out[40]: []

In [41]: response.xpath('//*[@id="js-compare-results"]/div/div[2]/div[2]/div/div[1]/div/div[1]/div[1]/span[1]').extract()
Out[41]: []

I couldnt find a way to get any information, except for the device name, which is : response.xpath('//*[@class="phone-info__phone"]/text()').extract_first()

In the end I would like to have something like

[device name, operator (e.g. t-mobile), plan (e.g. 1GB), period (e.g. 1 year) rate (e.g. 15€)]

Does anyone know how to correctly extract (if possible) such information from this page?

Thank you in advance.

**Edit 1: spider sourcecode**

    # -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.http import Request
from scrapy_splash import SplashRequest
import re

class TmnlPricecrawlerSpider(Spider):
    name = 'tmnl_pricecrawler'
    allowed_domains = ['www.mobiel.nl']
    start_urls = ['https://www.mobiel.nl/smartphone#']

    def parse(self, response):
        #Process spartphone pages - for this website, all phones are in the same page, no multi-pages processing needed
        mobielnl_items = response.xpath('//*[@class="phone-list-item__link"]/@href').extract()
        for item in mobielnl_items:
            item_url = response.urljoin(item)
            yield Request(item_url, callback=self.parse_mobielnl)

            #for url in item_url:
                #yield SplashRequest(url=url, callback=self.parse_mobielnl)


    def parse_mobielnl(self, response):
        yield SplashRequest(url=url, callback=self.parse_aaa)

    def parse_aaa():
        pass

I tried to fetch the inner urls using scrapy_splash but still no success.

Edit 2: I have realized that:

In [87]: response.xpath('//*[@id="price-comparator"]').extract_first()
Out[87]: '<div id="price-comparator" class="page-width page-width--spacing" data-style="mobielnl" data-token="EnsjtkLMsBkkYyLQVEZwqA" data-phone="803"></div>'

<div id="price-comparator" class="page-width page-width--spacing" data-style="mobielnl" data-token="EnsjtkLMsBkkYyLQVEZwqA" data-phone="803"><iframe src="https://pcnltelecom.tdsapi.com/portal/iframe/full_compare/?api_token=EnsjtkLMsBkkYyLQVEZwqA&amp;api_domain=https%3A%2F%2Fwww.mobiel.nl&amp;dom_id=price-comparator&amp;iframe_options[style]=mobielnl&amp;iframe_options[click_outs_in_parent]=true&amp;iframe_options[show_sponsored_positions]=false&amp;iframe_options[filter][phones][]=803&amp;iframe_options[type_options][phone_offers][show]=false&amp;iframe_options[type_options][propositions][show]=true&amp;iframe_options[type_options][sim_only][show]=false" width="100%" scrolling="no" frameborder="0" class="pc-iframe" id="iFrameResizer0" style="overflow: hidden; min-height: 500px; height: 1240.1px;"></iframe></div>

enter image description here

The items data-token and data-phone feed these numbers to the URL where the data points I need are requested from, so it would be the way to go trying to fetch this info and replace them in the url or is there another more adequate way of doing something like this?

Upvotes: 0

Views: 206

Answers (1)

gangabass
gangabass

Reputation: 10666

If you check above URL with Chrome DevTools you'll find that this information is requested throught separate AJAX call to this URL

That's why your XPath expressions don't work.

Upvotes: 1

Related Questions