Reputation: 3895

Setting proxies when crawling websites with Python

I want to set proxies to my crawler. I'm using requests module and Beautiful Soup. I have found a list of API links that provide free proxies with 4 types of protocols.

All proxies with 3/4 protocols work (HTTP, SOCKS4, SOCKS5) except one, and thats proxies with HTTPS protocol. This is my code:

from bs4 import BeautifulSoup
import requests
import random
import json

# LIST OF FREE PROXY APIS, THESE PROXIES ARE LAST TIME TESTED 50 MINUTES AGO, PROTOCOLS: HTTP, HTTPS, SOCKS4 AND SOCKS5
list_of_proxy_content = ["https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=CH&protocols=http%2Chttps%2Csocks4%2Csocks5",
                        "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=FR&protocols=http%2Chttps%2Csocks4%2Csocks5",
                        "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=DE&protocols=http%2Chttps%2Csocks4%2Csocks5",
                        "https://proxylist.geonode.com/api/proxy-list?limit=1500&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=AT&protocols=http%2Chttps%2Csocks4%2Csocks5",
                        "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=IT&protocols=http%2Chttps%2Csocks4%2Csocks5"]


# EXTRACTING JSON DATA FROM THIS LIST OF PROXIES
full_proxy_list = []
for proxy_url in list_of_proxy_content:
    
    proxy_json = requests.get(proxy_url).text
    proxy_json = json.loads(proxy_json)
    proxy_json = proxy_json["data"]

    full_proxy_list.extend(proxy_json)

# CREATING PROXY DICT
final_proxy_list = []
for proxy in full_proxy_list:

    #print(proxy) # JSON VALUE FOR ALL DATA THAT GOES INTO PROXY

    protocol = proxy['protocols'][0]
    ip_ = proxy['ip']
    port = proxy['port']
        
    proxy = {protocol : protocol + '://' + ip_ + ':' + port}

    final_proxy_list.append(proxy)


# TRYING PROXY ON 3 DIFERENT WEBSITES
for proxy in final_proxy_list:

    print(proxy)
    try:
        r0 = requests.get("https://edition.cnn.com/", proxies=proxy, timeout = 15)
        if r0.status_code == 200:
            print("GOOD PROXY")
        else:
            print("BAD PROXY")
    except:
        print("proxy error")
        
    try:        
        r1 = requests.get("https://www.buelach.ch/", proxies=proxy, timeout = 15)
        if r1.status_code == 200:
            print("GOOD PROXY")        
        else:
            print("BAD PROXY")
    except:
        print("proxy error")
        
    try:      
        r2 = requests.get("https://www.blog.police.be.ch/", proxies=proxy, timeout = 15)
        if r2.status_code == 200:
            print("GOOD PROXY")        
        else:
            print("BAD PROXY")
    except:
        print("proxy error")

    print()

My question is, why does HTTPS proxies do not work, what am I doing wrong?

My proxies look like this:

{'socks4': 'socks4://185.168.173.35:5678'}
{'http': 'http://62.171.177.80:3128'}
{'https': 'http://159.89.28.169:3128'}

I have seen that sometimes people pass proxies like this:

proxies = {"http": "http://10.10.1.10:3128",
           "https": "http://10.10.1.10:1080"}

But this dict has 2 protocols, but in links its only http, why? Can I pass only one, can I pass 10 different IP addresses in this dict?

Upvotes: 4

Answers (4)

Melisa

Reputation: 432

You must introduce your certificate like below. it works for me. I don't know free proxy service provides certificate but you can get a certificate on SSL services or proxy providers. My proxy provider(zyte) also provide CA Certificate.

verify='C:/Python39/zyte-proxy-ca.crt'

An example;

import requests
from bs4 import BeautifulSoup   

response = requests.get(
    "https://www.whatismyip.com/",
    proxies={
        "http": "http://proxy:port/",
        "https": "http://proxy:port/",
    },
    verify='C:/Python39/zyte-proxy-ca.crt' 
)

print("Scrape Process Has Been Successfuly...")


soup = BeautifulSoup(response.text, 'lxml')
print(soup.title)

Upvotes: 0

DataMinion

Reputation: 437

There are several things wrong with your code. I will tackle the low hanging fruit first.

First, your SOCKS proxies aren't working either. Here's why. The correct way to write the proxy dictionary can be found in the requests documentation.

# your way
proxy = {'socks4': 'socks4://ip:port'}

# the correct way
proxy = {'https': 'socks4://ip:port'}   # note the s in https

# or another correct way
proxy = {'http': 'socks4://ip:port'}  # Note the http with no s

# best correct way if your urls are mixed http:// https://
proxies = {
  'http': 'socks4://ip:port',
  'https': 'socks4://ip:port',
}

The http and https in those entries isn't the protocol of the proxy server, it's of your url.

For example: https://www.example.com vs http://www.example.com.

A request to an https:// url would go to the https entry, whereas a request to an http:// url would go through the http entry. If you only supply one entry {'http': 'socks4://ip:port'}, and a url request is for an https:// url, that request will not get proxied, and your own ip will be exposed. Since there is no such thing as socks4://www.example.com when browsing, the requests you were making were not proxied.

When doing any work through proxies and VPNs, I don't like testing the code and sending requests to the servers I will be running the final code on. I like using ipinfo.io. Their json response includes info on the connecting ip. This way, I can ensure the connection is going through the proxy and not sending false positives.

Note: It's not uncommon for the connecting IP to differ from the proxy ip due to load balancers. Just make sure the connecting ip isn't your own. You can check your own ip via visiting the url in the code below using your browser.

Because you were using {'socks4': 'socks4://ip:port'} instead of the correct {'https': 'socks4://ip:port'}, you were still getting 200 status codes and your code was returning a false positive. It was returning the 200 because you did, in fact, connect, but with your own ip and not through a proxy.

Since you didn't provide specifics on what was actually happening, I added a bit of quick and dirty error handling to your code to find out what was going on. Some of the errors are related to server side config as most https proxies will require some sort of authentication, like a certificate or login (despite them being "free" and "public".

My imperfect, but working code is below. Tested on Python 3.8.12. Some info on the proxy connection errors are below it.

HINT: Check your urls. country=CH in the first one should probably say country=CN and country=AT should probably say country=AR. My code reflects that.

from bs4 import BeautifulSoup
import requests
import json
import time

# LIST OF FREE PROXY APIS, THESE PROXIES ARE LAST TIME TESTED 50 MINUTES AGO
# PROTOCOLS: HTTP, HTTPS, SOCKS4 AND SOCKS5
list_of_proxy_content = [
    "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=CN&protocols=http%2Chttps%2Csocks4%2Csocks5",
    "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=FR&protocols=http%2Chttps%2Csocks4%2Csocks5",
    "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=DE&protocols=http%2Chttps%2Csocks4%2Csocks5",
    "https://proxylist.geonode.com/api/proxy-list?limit=1500&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=AR&protocols=http%2Chttps%2Csocks4%2Csocks5",
    "https://proxylist.geonode.com/api/proxy-list?limit=150&page=1&sort_by=lastChecked&sort_type=desc&filterLastChecked=50&country=IT&protocols=http%2Chttps%2Csocks4%2Csocks5",
]


# EXTRACTING JSON DATA FROM THIS LIST OF PROXIES
full_proxy_list = []
for proxy_url in list_of_proxy_content:

    proxy_json = requests.get(proxy_url).text
    proxy_json = json.loads(proxy_json)
    proxy_json = proxy_json["data"]

    full_proxy_list.extend(proxy_json)

    if not full_proxy_list:
        print("No proxies to check. Exiting...")
        exit
    else:
        print(f"Found {len(full_proxy_list)} proxy servers. Checking...\n")

# CREATING PROXY DICT
final_proxy_list = []
for proxy in full_proxy_list:

    # print(proxy)  # JSON VALUE FOR ALL DATA THAT GOES INTO PROXY

    protocol = proxy["protocols"][0]
    ip_ = proxy["ip"]
    port = proxy["port"]

    proxy = {
        "https": protocol + "://" + ip_ + ":" + port,
        "http": protocol + "://" + ip_ + ":" + port,
    }

    final_proxy_list.append(proxy)

# TRYING PROXY ON 3 DIFERENT WEBSITES
for proxy in final_proxy_list:

    print(proxy)
    try:
        # Use ipinfo.io to test proxy ip
        url = "https://ipinfo.io/json?token=67e01402d14101"
        r0 = requests.get(url, proxies=proxy, timeout=15)

        if r0.status_code == 200:
            # The 3-line block below only works on ipinfo.io
            output = r0.json()
            real_ip = output["ip"]
            print(f"GOOD PROXY [IP = {real_ip}] {proxy}\n")

            # Do something with the response
            html_page = r0.text
            soup = BeautifulSoup(r0.text, "html.parser")
            print(soup, "\n")

            r0.close()  # close the connection so it can be reused

            # Break out of the proxy loop so we do not send multiple successful
            # requests to the same url. Info needed was already obtained.
            # Comment out to check all possible proxies during testing.
            break
        else:
            # If the response code is something other than 200,
            # it means the proxy worked, but the website did not.
            print(f"BAD URL: [status code: {r0.status_code}]\n{r0.headers}\n")
            r0.close()

        time.sleep(5)  # Don't overload the server

    except Exception as error:
        print(f"BAD PROXY: Reason: {str(error)}\n")

Most of the errors you see will be a timeout error, which should be self-explanatory.

The other errors are server end errors due to their configurations preventing you from connecting.

Short list without getting too technical:

Remote end closed connection without response is the server end just flat out refused to send your request despite connecting to it.
407 Proxy Authentication Required is one of those errors I mentioned above. This either wants you to provide user/pass or a certificate.
[Errno 111] Connection refused is one of those errors I mentioned above.

IMPORTANT: If you see either of these following errors check_hostname requires server_hostname, EOF occurred in violation of protocol, or SSL: WRONG_VERSION_NUMBER after running the above code, downgrade your urllib3 library. There's a proxy bug as well as a few others in some of the most recent versions. You can do this by using the command pip install -U urllib3==1.25.11 or python3 -m pip install -U urllib3==1.25.11.

Upvotes: 1

Pranav Kumar

Reputation: 354

What you are looking for is a class to hold the proxies, that rotate periodically after operations in your crawler (time dependent or instructions dependent), to mask your identity.

proxies = {"http": "http://10.10.1.10:3128",
           "https": "http://10.10.1.10:1080"}

These type of ip addresses reference a domain that internally refresh the underlying ip addresses and redirect you to one of the ip addresses underneath the domain.

Most of these are paid services.

Upvotes: 0

user14772095

Reputation:

I did some research on the topic and now I'm confused why you want a proxy for HTTPS.

While it is understandable to want a proxy for HTTP, (HTTP is unencrypted) HTTPS is secure.

Could it be possible your proxy is not connecting because you don't need one?

I am not a proxy expert, so I apologize if I'm putting out something completely stupid.

I don't want to leave you completely empty-handed though. If you are looking for complete privacy, I would suggest a VPN. Both Windscribe and RiseUpVPN are free and encrypt all your data on your computer. (The desktop version, not the browser extension.)

While this is not a fully automated process, it is still very effective.

Upvotes: 0

Setting proxies when crawling websites with Python

Answers (4)

Related Questions