Anton
Anton

Reputation: 1

How can I send TOO MANY requests to DIFFERENT sites and get responses?

How can I send a lot of requests to DIFFERENT sites, I have a database of sites (1kk) and need to check whether they are alive or not, conditionally if you just do it through grequests (Python) chunks of software (100 requests in 10 threads ~128second) it will take 12.5 days, but for me it's too long and I I am sure that this can be done much faster.

Can you tell me what I can use in this case? I'm just collecting information about the main page of the sites.

Here is my code, I want to improve it somehow, what can you recommend? I tried to throw every request into the stream, but it feels like something is blocking it, I will use a proxy so that my IP is not blocked due to more requests

Help who can!

def start_parting(urls:list,chunk_num,chunks):
    if len(urls)>0:
        chunk_num+=1
        print(f'Chunck [{Fore.CYAN}{chunk_num}/{chunks}{Style.RESET_ALL}] started! Length: {len(urls)}')
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"}
        rs = [grequests.get(url.split(' ')[0].strip(), headers=headers,timeout=10) for url in urls]
        responses = grequests.map(rs)
        for response in responses:
            if response is None:
                continue
            if response.status_code == 200:
                check_pattern = r'(pat1|pat2)'
                match = re.search(check_pattern, response.text, re.IGNORECASE)
                if match:
                    site = match.group(1)
                    print(f'Site {site}')
        print(f'Chunck [{Fore.LIGHTCYAN_EX}{chunk_num}/{chunks}{Style.RESET_ALL}] ended!')

def test_sites_for_file(file,num_threads = 10,chunk_size=100):
    print('Start check!')
    urls = file.readlines()
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        parts = [urls[i:i + chunk_size] for i in range(0, len(urls), chunk_size)]
        finals = [executor.submit(start_parting, part , part_num,len(parts)) for part_num,part in enumerate(parts)]
        t = time.time()
        for final in as_completed(finals):
            pass
        print(f'Resultate: {time.time()-t}')

Upvotes: -1

Views: 88

Answers (1)

Kodiologist
Kodiologist

Reputation: 3495

You can try switching to HTTP HEAD (via grequests.head) instead of GET. That, in principle, should be faster, because the bodies of these pages, which you're going to ignore anyway, won't have to be transmitted. Other than that, I don't think there's much you can do to speed this up purely in software, if you've already parallelized it. Making a huge amount of HTTP requests takes time.

Upvotes: 1

Related Questions