jambormike
jambormike

Reputation: 43

Splash-scrapy unable to render particular JavaScript web

I am trying to use Scrapy framework to scrape https://www.sreality.cz/en/search/for-sale/apartments website.

Portion of the web's code is written in JavaScript, so I am trying to use Splash Docker container to provide me with html which I could easily parse.

I downloaded the scrapinghub/splash Docker image and started its container at port 8050 in terminal.

% docker pull scrapinghub/splash

% docker run -p 8050:8050 scrapinghub/splash

In settings.py file in my scrapy project directory I added these lines of code as instructed at https://github.com/scrapy-plugins/scrapy-splash.

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

I created a new spider in my project directory.

import scrapy
from scrapy_splash import SplashRequest

class FlatSpider(scrapy.Spider):
    name = "flat"
    def start_requests(self):
        # sreality url
        url = 'https://www.sreality.cz/en/search/for-sale/apartments'

        # beer test url
        # url = 'https://www.beerwulf.com/en-gb/c/mixedbeercases'

        yield SplashRequest(url=url, callback=self.parse, args={'wait': 0.5})

    def parse(self, response):

        # sreality variable
        foo = response.css('span.name.ng-binding::text').get()

        # beer test variable
        # foo = response.css('h4.product-name::text').get()

        print(foo)

If I run this spider using % scrapy crawl flat in terminal it prints None even though it should return text (which I can see in Chrome inspector). But otherwise it all seems to work. If I comment in the two 'beer test' lines of code it successfully renders html I can parse and the code prints the text in terminal.

Also, when I open Splash in http://localhost:8050 and try to render the web https://www.sreality.cz/en/search/for-sale/apartments it does not seem to work correctly. However, it works for different webs.

For some reason this scraping solution does not work for this particular web that I am interested in. I am trying to figure out why and how to get response.css from this web that I could easily parse.

I run this on macOS 13.0.1 Apple silicon if it matters.

Upvotes: 0

Views: 255

Answers (1)

Radwan
Radwan

Reputation: 74

I tried to use Splash before but the community for Splash is not active anymore, there is a better plugin to scrape interactive websites which is scrapy-playwright .

Upvotes: 4

Related Questions