robots.txt
robots.txt

Reputation: 137

Unable to pass proxies and links to the threadpool to get results

I've written a script in python using proxies to scrape the links of different posts traversing different pages of a webpage. I've tried to make use of proxies from a list. The script is supposed to take random proxies from the list and send request to that website and finally parse the items. However, if any proxy is not working then it should be kicked out from the list.

I thought the way I've used number of proxies and list of urls within ThreadPool(10).starmap(make_requests, zip(proxyVault,lead_url)) is accurate but it doesn't produce any results; rather, the script gets stuck.

How can I pass the proxies and the links to the ThreadPool in order for the script to produce results?

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from multiprocessing.pool import ThreadPool
from itertools import cycle
import random

base_url = 'https://stackoverflow.com/questions/tagged/web-scraping'
lead_url = ["https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=15".format(page) for page in range(1,6)]

proxyVault = ['104.248.159.145:8888', '113.53.83.252:54356', '206.189.236.200:80', '218.48.229.173:808', '119.15.90.38:60622', '186.250.176.156:42575']

def make_requests(proxyVault,lead_url):
    while True:
        random.shuffle(proxyVault)
        global pitem   
        pitem = cycle(proxyVault)
        proxy = {'https':'http://{}'.format(next(pitem))}
        try:
            res = requests.get(lead_url,proxies=proxy)
            soup = BeautifulSoup(res.text,"lxml")
            [get_title(proxy,urljoin(base_url,item.get("href"))) for item in soup.select(".summary .question-hyperlink")]
        except Exception:
            try: 
                proxyVault.pop(0)
                make_requests(proxyVault,lead_url)
            except Exception:pass

def get_title(proxy,itemlink):
    res = requests.get(itemlink,proxies=proxy)
    soup = BeautifulSoup(res.text,"lxml")
    print(soup.select_one("h1[itemprop='name'] a").text)

if __name__ == '__main__':
    ThreadPool(10).starmap(make_requests, zip(proxyVault,lead_url))

Btw, the proxies used above are just placeholders.

Upvotes: 2

Views: 317

Answers (2)

SimonF
SimonF

Reputation: 1885

The problems with your code was that it was creating a lot of endless loops in the thread. Also they way you handled the proxies was a bit strange to me, so I changed it. I also think you had misunderstood how data was sent to the threads, they get one one element of the iterable, not the whole thing. So I changed some names to reflect that.

The way it works now is that each thread gets their own url from lead_url, then they choose a random proxy from the proxyVault. They fetch the webpage and parse it and calls get_title on each of the parsed links.

If the request fails because of the proxy, that proxy is removed from the list so its not used again and make_requests is called again, which will randomly choose a new proxy from the ones that are still available. I did not change the actual parsing, because I can't judge if it's what you want or not.

Runnable code:

https://repl.it/@zlim00/unable-to-pass-proxies-and-links-to-the-threadpool-to-get-re

from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
from random import choice
import requests
from urllib.parse import urljoin

base_url = 'https://stackoverflow.com/questions/tagged/web-scraping'
lead_url = [f'https://stackoverflow.com/questions/tagged/web-scraping?sort='
            f'newest&page={page}&pagesize=15' for page in range(1, 6)]

proxyVault = ['36.67.57.45:53367', '5.202.150.233:42895',
              '85.187.184.129:8080', '109.195.23.223:45947']

def make_requests(url):
    proxy_url = choice(proxyVault)
    proxy = {'https': f'http://{proxy_url}'}
    try:
        res = requests.get(url, proxies=proxy)
        soup = BeautifulSoup(res.text, "lxml")
        [get_title(proxy, urljoin(base_url, item.get("href")))
         for item in soup.select(".summary .question-hyperlink")]
    except requests.exceptions.ProxyError:
        # Check so that the bad proxy was not removed by another thread
        if proxy_url in proxyVault:
            proxyVault.remove(proxy_url)
            print(f'Removed bad proxy: {proxy_url}')
        return make_requests(url)

def get_title(proxy, itemlink):
    res = requests.get(itemlink, proxies=proxy)
    soup = BeautifulSoup(res.text, "lxml")
    print(soup.select_one("h1[itemprop='name'] a").text)

if __name__ == '__main__':
    ThreadPool(10).map(make_requests, lead_url)

Upvotes: 2

aafirvida
aafirvida

Reputation: 541

Maybe you can use another approach to get proxies like this

def get_proxy():                                                                                                                                                                                  
    url = 'https://free-proxy-list.net/anonymous-proxy.html'                                                                                                                                      
    response = requests.get(url)                                                                                                                                                                  
    soup = BeautifulSoup(response.text, 'lxml')                                                                                                                                                   
    table = soup.find('table', attrs={'id': 'proxylisttable'})                                                                                                                                    
    table_body = table.find('tbody')                                                                                                                                                              
    proxies = table_body.find_all('tr')                                                                                                                                                           
    proxy_row = random.choice(proxies).find_all('td')                                                                                                                                             
    return proxy_row[0].text + ':' + proxy_row[1].text  

Upvotes: -1

Related Questions