Reputation: 110113
When doing a scrape of a site, which would be preferable: using curl, or using Python's requests library?
I originally planned to use requests and explicitly specify a user agent. However, when I use this I often get an "HTTP 429 too many requests" error, whereas with curl, it seems to avoid that.
I need to update metadata information on 10,000 titles, and I need a way to pull down the information for each of the titles in a parallelized fashion.
What are the pros and cons of using each for pulling down information?
Upvotes: 4
Views: 3011
Reputation: 11791
I'd go for the in-language version over an external program any day, because it's less hassle.
Only if it turns out unworkable would I fall back to this. Always consider that people's time is infinitely more valuable than machine time. Any "performance gains" in such an application will probably be swamped by network delays anyway.
Upvotes: 0
Reputation: 10507
Using requests would allow you to do it programmatically, which should result in a cleaner product.
If you use curl, you're doing os.system calls which are slower.
Upvotes: 2
Reputation: 28747
Since you want to parallelize the requests, you should use requests
with grequests
(if you're using gevent, or erequests
if you're using eventlet). You may have to throttle how quickly you hit the website though since they may do some ratelimiting and be refusing you for requesting too much in too short a period of time.
Upvotes: 3