Reputation: 43
Hi Stackoverflow community,
I would like to create a script that uses multi threading to create a high number of parallel requests for HTTP status codes on a large list of URL's (more than 30k vhosts).
The requests can be executed from the same server where the websites are hosted.
I was using multithreaded curl requests, but I'm not really satisfied with the results I've got. For a complete check of 30k hosts it takes more than an hour.
I am wondering if anyone has any tips or is there a more performant way to do it?
Upvotes: 1
Views: 786
Reputation: 43
After testing some of the available solutions, the simplest and the fastest way was using webchk
webchk is a command-line tool developed in Python 3 for checking the HTTP status codes and response headers of URLs
The speed was impressive, the output was clean, it parsed 30k vhosts in about 2 minutes
https://webchk.readthedocs.io/en/latest/index.html
https://pypi.org/project/webchk/
Upvotes: 2
Reputation: 4275
If you're looking for parallelism and multi-threaded approaches to make HTTP requests with Python, then you might start with the aiohttp
library, or use the popular requests
package. Multithreading can be accomplished with multiprocessing
, from the standard library.
Here's a discussion of rate limiting with aiohttp client: aiohttp: rate limiting parallel requests
Here's a discussion about making multiprocessing work with requests
https://stackoverflow.com/a/27547938/10553976
Making it performant is a matter of your implementation. Be sure to profile your attempts and compare to your current implementation.
Upvotes: 0