Reputation: 11
I am trying to scrape hotel reviews from booking.com with the python plugin scrapy.
My problem is, that the desired data (e.g. negative feedbacks) can't be found by scrapy. I think, it's because of the javascript code embedded in the site.
Therefore, I tried to change my user-agent in the settings.py file but nothing changed. Then I tried to emulate a browser request but I'm not sure if I did it correctly.
Here is the link to the hotel I want to scrape the reviews of: https://www.booking.com/hotel/de/best-western-plus-marina-star-lindau.de.html
This is my Spider:
import scrapy
class FeedbacktestSpider(scrapy.Spider):
name = 'feedbacktest'
allowed_domains = ['www.booking.com/']
start_urls = ['https://www.booking.com/hotel/de/best-western-plus-marina-star-lindau.de.html']
def start_requests(self):
urls=['https://www.booking.com/hotel/de/best-western-plus-marina-star-lindau.de.html']
headers = {
'Host': 'www.booking.com',
'Device-Memory': '8',
'DPR': '1',
'Viewport-Width': '1920',
'RTT': '50',
'Downlink': '10',
'ECT': '4g',
'Upgrade-Insecure-Requests': '1',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Safari/537.36 Edg/89.0.774.45',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site':'same-origin',
'Sec-Fetch-Mode':'navigate',
'Sec-Fetch-User':'?1',
'Sec-Fetch-Dest':'document',
'Referer':'https://www.booking.com/',
'Accept-Encoding':' gzip, deflate, br',
'Accept-Language':'de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Cookie':'__utma=12798129.959027148.1615055069.1615055069.1615055069.1; __utmc=12798129; __utmz=12798129.1615055069.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmt=1; __utmb=12798129.1.10.1615055069'
}
for url in urls:
yield scrapy.Request(url = url, callback = self.parse, headers=headers)
def parse(self, response):
pos = response.xpath("//div[@class='althotelsDiv2 use_sprites_no_back featured_reviewer']/p/span/text()").extract()
yield{
'pos': pos
}
For the User-Agent in the settings.py I tried my own User-Agent and the Google-Agent.
Thank you very much for your help
Upvotes: 1
Views: 752
Reputation: 11
Okay, I solved the problem:
I viewed the site I want to scrape with my network tool and looked for the request where the desired data is requested.
Then I scraped for this link instead of the originial link and in my scrapy settings.py I set the ROBOTSTXT_OBEY = False, so that I don't get blocked by the site.
Upvotes: 0