Reputation: 1
i'm new using scrapy and I have a doubt about the urls that are scraped.
I'm trying to scrape a site that every page that you go redirects to the homepage, when you click in a banner you can acess other pages. I've tried to use
meta={'dont_redirect': True, 'handle_httpstatus_list': [301, 302]
to avoid the redirecting but the scraped from url was still wrong. So i thought that the problem was the cookies and to test it i've hard code the cookies to be the same as the browser when enter the site and now it'isnt redirecting and I dont even need to put the 'dont_redirect' in the meta but when I look the debugger it is still scraping the homepage.
for now the code is like this:
import scrapy
class MatchOpeningSpider(scrapy.Spider):
name = 'bet_365_match_opening'
start_urls = [
'https://www.bet365.com/#/AC/B1/C1/D13/E38078994/F2/'
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, cookies={
'pstk': '04761A56B7A54D9BB3948A093FB9F440000003',
'rmbs': 3,
'aps03': 'lng=22&tzi=34&oty=2&ct=28&cg=1&cst=0&hd=N&cf=N',
'session': 'processform=0&fms=1'
})
def parse(self, response):
games = response.css('div.sl-CouponParticipantWithBookCloses_Name').extract()
yield {'games': games}
the debug you can see the Crawled url is right but the Scraped from is the homepage
2019-04-21 12:02:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bet365.com/#/AC/B1/C1/D13/E38078994/F2/> (referer: None)
2019-04-21 12:02:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bet365.com/>
What i'm doing wrong? Thanks for helping!!!
Upvotes: 0
Views: 365
Reputation: 190
In your start_url there is a fragment identifier (the sharp sign: #) in the middle, the context after it will not proceed by browser
Which means the data you need, might not in the HTTP response of the the start_url, but from some other Ajax calls after this main document request and render by client side
My suggestions:
Use browser's dev tools, or Scrapy shell, or even CURL tools to ensure, the content you need is exists in the http response of the start_url first. Or you're scraping the wrong URL
Make the http headers, cookies, totally the same with how it goes in a real browser. Scrapy handle 3xx redirect and cookie changes for you, but you'll need to find and represent the actual visiting path in your spider program
If the data is rendering from client-side and you're tired of this, try Selenium based spider, to use a browser with JS engine to go over these problems
Upvotes: 1