Reputation: 357
I'm trying to get some data from an ajax based site where I have to get the public page and grab some ids from there so I can then simulate the ajax request to the server.
The problem is the second request requires a timestamp parameter. It looks something like:
https://sub.domain.com/id/?z=9999999999
Where the z parameter is the UNIX time up to the second. After doing some testing, it turns out that the request is only available for a few seconds. If the timestamp isn't in that range, the server throws a 404.
Scrapy uses a generator to iterate over the requests, so if I'm creating a few dozen requests and on top of that I have the 'DOWNLOAD_DELAY' parameter set to wait a couple seconds between requests, that means the requests take a while between being created and when they are actually executed, so by then the timestamp has already expired and I get an error page.
My question is. Is there a way to add the parameter right before the actual request is executed? or alternatively, is it possible to execute a request on the spot instead of yielding it to the generator?.
Upvotes: 0
Views: 621
Reputation: 33168
Is there a way to add the parameter right before the actual request is executed?
Yes, a custom DownloaderMiddleware process_request
is designed for that purpose
is it possible to execute a request on the spot instead of yielding it to the generator?
Yes and no; yes, because it's python, you can do what you'd like, but no, because if you just use urllib
or requests
or whatever, you completely circumvent all the Scrapy benefits, including your aforementioned DOWNLOAD_DELAY
setting. You may be able to use the priority=
kwarg to tell Scrapy about the time sensitivity of the Request
, but ultimately whether it is able to get that Request
scheduled and executed within the time window of the timestamp is dependent upon how many of those there are in the queue.
I would try your first approach, by rewriting the timestamp, rather than trying to expedite an existing one -- especially if those timestamps are not meaningful to the server (that is, all they're doing is now() - query.z
for validity, and not select * from whatever where z = query.z
)
Upvotes: 1