Reputation: 2917
I am trying to crawl a website with scrapy where the pagination is behind the sign "#". This somehow makes scrapy ignore everything behind that character and it will always only see the first page.
e.g.:
If you enter a question mark manually, the site will load page 1
The stats from scrapy tell me it fetched the first page:
DEBUG: Crawled (200) http://www.rolex.de/de/watches/datejust/m126334-0014.html> (referer: http://www.rolex.de/de/watches/find-rolex.html)
My crawler looks like this:
start_urls = [
'http://www.rolex.de/de/watches/find-rolex.html#g=1',
'http://www.rolex.de/de/watches/find-rolex.html#g=0&p=2',
'http://www.rolex.de/de/watches/find-rolex.html#g=0&p=3',
]
rules = (
Rule(
LinkExtractor(allow=['.*/de/watches/.*/m\d{3,}.*.\.html']),
callback='parse_item'
),
Rule(
LinkExtractor(allow=['.*/de/watches/find-rolex(/.*)?\.html#g=1(&p=\d*)?$']),
follow=True
),
)
How can I make scrapy ignore the # inside the url and visit the given URL?
Upvotes: 0
Views: 121
Reputation: 3857
Scrapy performs HTTP requests. The data after '#' in a URL is not part of an HTTP request, it is used by JavaScript.
As suggested in the comments, the site loads the data using AJAX.
Moreover, it does not use pagination in AJAX: the site downloads the whole list of watches as JSON in a single request, and then the pagination is done using JavaScript.
So, you can just use the Network tab of the developer tools of your web browser to see the request that obtains the JSON data, and perform a similar request instead of requesting the HTML page.
Note, however, that you cannot use LinkExtractor
for JSON data. You should simply parse the response with Python’s json
and iterate the URLs there.
Upvotes: 1