Hans
Hans

Reputation: 31

Scrapy crawler response url vs request url

I am quite new to the world of programming and python. I am currently exploring the scapy framework in python. I built a spider that pulls information from different ads on different pages. When I navigate to the next page I run into an issue that I cannot get resolved.

The response url differs from the request url, so it changes my search query. Attached an example of the scrapy shell in which this difference is visible. Can someone explain to me why this is happening and how I can prevent it?

(scrapyvirtualenv) PS C:\Users\X\Desktop\Python1\scrapy\webcrawler> scrapy shell "https://www.marktplaats.nl/l/huis-en-inrichting/kasten-dressoirs/p/2/#q:jaren+60" 
The JSON file does not exist
The CSV file does not exist
2020-04-19 12:02:12 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: WebCrawler)
2020-04-19 12:02:12 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.9, Platform Windows-10-10.0.18362-SP0
2020-04-19 12:02:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-04-19 12:02:12 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'WebCrawler',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'WebCrawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['WebCrawler.spiders']}
2020-04-19 12:02:12 [scrapy.extensions.telnet] INFO: Telnet Password: something
2020-04-19 12:02:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2020-04-19 12:02:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-04-19 12:02:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-04-19 12:02:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-04-19 12:02:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-19 12:02:12 [scrapy.core.engine] INFO: Spider opened
2020-04-19 12:02:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marktplaats.nl/robots.txt> (referer: None)
2020-04-19 12:02:12 [protego] DEBUG: Rule at line 2 without any user agent to enforce it on.
2020-04-19 12:02:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marktplaats.nl/l/huis-en-inrichting/kasten-dressoirs/p/2/#q:jaren+60> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x00000255C48D9100>
[s]   item       {}
[s]   request    <GET https://www.marktplaats.nl/l/huis-en-inrichting/kasten-dressoirs/p/2/#q:jaren+60>
[s]   response   <200 https://www.marktplaats.nl/l/huis-en-inrichting/kasten-dressoirs/p/2/>
[s]   settings   <scrapy.settings.Settings object at 0x00000255C48D6E20>
[s]   spider     <MarktplaatsSpider 'marktplaats' at 0x255c4c29f40>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

I have tried to adjust the yield function, with the meta={'dont_redirect': True} but that does not have the desired result.

yield scrapy.Request(next_page_url, meta={'dont_redirect': True}, callback=self.parse)

Upvotes: 1

Views: 411

Answers (2)

Hans
Hans

Reputation: 31

Thank you very much guys!

While doing a little reading on your suggestions I found this: https://www.youtube.com/watch?v=EelmnSzykyI which ultimately led me to this https://ianlondon.github.io/blog/web-scraping-discovering-hidden-apis/, what helped me to find a solution to my problem.

Upvotes: 2

wishmaster
wishmaster

Reputation: 1487

the # in your URl is simply put there to be used by your browser (chrome...) to move directly to that CSS,you can let scrapy handle it for you or not.

A good documentation on it is Ajax Scrapy.

I recommend removing it and requesting the page normally

Upvotes: 0

Related Questions