IndiaSke
IndiaSke

Reputation: 358

Scrapy - Splash fetch dynamic data

I am trying to fetch dynamic phone number from this page (among others): https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html

The phone number appears after a click on the element div with the class page-action click-tel. I am trying to get to this data with scrapy_splash using a LUA script to execute a click.

After pulling splash on my ubuntu:

sudo docker run -d -p 8050:8050 scrapinghub/splash

Here is my code so far (I am using a proxy service) :

class company(scrapy.Spider):
    name = "company"
    custom_settings = {
        "FEEDS" : {
            '/home/ubuntu/scraping/europages/data/company.json': {
                'format': 'jsonlines',
                'encoding': 'utf8'
            }
        },
        "DOWNLOADER_MIDDLEWARES" : { 
            'scrapy_splash.SplashCookiesMiddleware': 723, 
            'scrapy_splash.SplashMiddleware': 725, 
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
        },
        "SPLASH_URL" : 'http://127.0.0.1:8050/',
        "SPIDER_MIDDLEWARES" : { 
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 
        },
        "DUPEFILTER_CLASS" : 'scrapy_splash.SplashAwareDupeFilter',
        "HTTPCACHE_STORAGE" : 'scrapy_splash.SplashAwareFSCacheStorage'

    }
    allowed_domains = ['www.europages.fr']

    def __init__(self, company_url):
        self.company_url = "https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html" ##forced
        self.item = company_item()
        self.script = """
            function main(splash)
                splash.private_mode_enabled = false
                assert(splash:go(splash.args.url))
                assert(splash:wait(0.5))
                local element = splash:select('.page-action.click-tel') 
                local bounds = element:bounds()
                element:mouse_click{x=bounds.width/2, y=bounds.height/2}
                splash:wait(4)
                return splash:html()
            end
        """
            
    def start_requests(self):
        yield scrapy.Request(
            url = self.company_url,
            callback = self.parse,
            dont_filter = True,
            meta = {
                    'splash': {
                        'endpoint': 'execute',
                        'url': self.company_url,
                        'args': {
                            'lua_source': self.script,
                            'proxy': 'http://usernamepassword@proxyhost:port',
                            'html':1,
                            'iframes':1

                        }
                    }   
            }
        )
    def parse(self, response):
        soup = BeautifulSoup(response.body, "lxml")
        print(soup.find('div',{'class','page-action click-tel'}))

The problem is that it has no effect, I still have nothing as if no button were clicked.

Shouldn't the return splash:html() return the results of element:mouse_click{x=bounds.width/2, y=bounds.height/2} (as element:mouse_click() waits for the changes to appear) in response.body ?

Am I missing something here ?

Upvotes: 0

Views: 341

Answers (1)

msenior_
msenior_

Reputation: 2110

Most times when sites load data dynamically, they do so via background XHR requests to the server. A close examination of the network tab when you click the 'telephone' button, shows that the browser sends an XHR request to the url https://www.europages.fr/InfosTelecomJson.json?uidsid=DEU241700-00101&id=1330. You can emulate the same in your spider and avoid using scrapy splash altogether. See sample implementation below using one url:

import scrapy
from urllib.parse import urlparse

class Company(scrapy.Spider):
    name = 'company'
    allowed_domains = ['www.europages.fr']
    start_urls = ['https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html']

    def parse(self, response):
        # obtain the id and uuid to make xhr request
        uuid = urlparse(response.url).path.split('/')[-1].rstrip('.html')
        id = response.xpath("//div[@itemprop='telephone']/a/@onclick").re_first(r"event,'(\d+)',")
        yield scrapy.Request(f"https://www.europages.fr/InfosTelecomJson.json?uidsid={uuid}&id={id}", callback=self.parse_address)

    def parse_address(self, response):
        yield response.json()

I get the response

{'digits': '+49 220 69 53 30'}

Upvotes: 1

Related Questions