Reputation: 43
I am trying to use Scrapy framework to scrape https://www.sreality.cz/en/search/for-sale/apartments website.
Portion of the web's code is written in JavaScript, so I am trying to use Splash Docker container to provide me with html which I could easily parse.
I downloaded the scrapinghub/splash Docker image and started its container at port 8050 in terminal.
% docker pull scrapinghub/splash
% docker run -p 8050:8050 scrapinghub/splash
In settings.py file in my scrapy project directory I added these lines of code as instructed at https://github.com/scrapy-plugins/scrapy-splash.
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
I created a new spider in my project directory.
import scrapy
from scrapy_splash import SplashRequest
class FlatSpider(scrapy.Spider):
name = "flat"
def start_requests(self):
# sreality url
url = 'https://www.sreality.cz/en/search/for-sale/apartments'
# beer test url
# url = 'https://www.beerwulf.com/en-gb/c/mixedbeercases'
yield SplashRequest(url=url, callback=self.parse, args={'wait': 0.5})
def parse(self, response):
# sreality variable
foo = response.css('span.name.ng-binding::text').get()
# beer test variable
# foo = response.css('h4.product-name::text').get()
print(foo)
If I run this spider using % scrapy crawl flat
in terminal it prints None
even though it should return text (which I can see in Chrome inspector).
But otherwise it all seems to work. If I comment in the two 'beer test' lines of code it successfully renders html I can parse and the code prints the text in terminal.
Also, when I open Splash in http://localhost:8050
and try to render the web https://www.sreality.cz/en/search/for-sale/apartments
it does not seem to work correctly. However, it works for different webs.
For some reason this scraping solution does not work for this particular web that I am interested in. I am trying to figure out why and how to get response.css
from this web that I could easily parse.
I run this on macOS 13.0.1 Apple silicon if it matters.
Upvotes: 0
Views: 255
Reputation: 74
I tried to use Splash before but the community for Splash is not active anymore, there is a better plugin to scrape interactive websites which is scrapy-playwright .
Upvotes: 4