Reputation: 93

How to get info/data from blocked web sites with BeautifulSoup?

I want to write a script with python 3.7. But first I have to scrape it. I have no problems with connecting and getting data from un-banned sites, but if the site is banned it won't work.

If I use a VPN service I can enter these "banned" sites with Chrome browser.

I tried setting a proxy in pycharm, but I failed. I just got errors all the time. What's the simplest and free way to solve this problem?

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup

req = Request('https://www.SOMEBANNEDSITE.com/', headers={'User-Agent': 'Mozilla/5.0'})  # that web site is blocked in my country

webpage = urlopen(req).read() # code stops running at this line because it can't connect to the site. 

page_soup = soup(webpage, "html.parser")

Upvotes: 0

Answers (2)

Bitto

Reputation: 8225

There are multiple ways to scrape blocked sites. A solid way is to use a proxy service as already mentioned.

A proxy server, also known as a "proxy" is a computer that acts as a gateway between your computer and the internet. When you are using a proxy, you requests are being forwarded through the proxy. Your ip is not directly exposed to the site that you are scraping.

You cant simply take any ip (say xxx.xx.xx.xxx) and port (say yy) do

import requests

proxies = { 'http': "http://xxx.xx.xx.xxx:yy", 
            'https': "https://xxx.xx.xx.xxx:yy"}

r = requests.get('http://www.somebannedsite.com', proxies=proxies)

and expect to get a response.

The proxy should be configured to take your request and send you a response.

so, where can you get a proxy?

a. You could buy proxies from many providers.

b. Use a list of free proxies from the internet.

You don't need to buy proxies unless you are doing some massive scale scraping. For now i will focus on free proxies available on the internet. Just do a google search for "free proxy provider" and you will find a list of sites offering free proxies. Go to any one of them and get any ip and corresponding port.

import requests

#replace the ip and port below with the ip and port you got from any of the free sites

proxies = { 'http': "http://182.52.51.155:39236", 
            'https': "https://182.52.51.155:39236"}

r = requests.get('http://www.somebannedsite.com', proxies=proxies)
print(r.text)

You should if possible use a proxy having 'Elite' anonymity level (the anonymity level will be specified in most of the sites providing the free proxy). If interested you could also do a google searh to find the difference between 'elite', 'anonymous' and 'transparent' proxies.

Note:

Most of these free proxies are not that reliable. So if you get error with one ip and port combination. try a different one.

Upvotes: 2

0xInfection

Reputation: 2919

Your best solution would be to use a proxy via the requests library. This would be the best solution for you since it has the capability of flexibly handling the requests via a proxy.

Here is a small example:

import requests
from bs4 import BeautifulSoup as soup
# use your usable proxies here
# replace host with you proxy IP and port with port number
proxies = { 'http': "http://host:port", 
            'https': "https://host:port"} 

text = requests.get('http://www.somebannedsite.com', proxies=proxies, headers={'User-Agent': 'Mozilla/5.0'}).text
page_soup = soup(text, "html.parser") # use whatever parser you prefer, maybe lxml?

If you want to use SOCKS5, then you'd have to get the dependencies via pip install requests[socks] and then replace the proxies part by:

# user is your authentication username
# pass is your auth password
# host and port are similar as above
proxies = { 'http': 'socks5://user:pass@host:port', 
            'https': 'socks5://user:pass@host:port' }

If you don't have proxies at hand, you can fetch some proxies.

Upvotes: 1

How to get info/data from blocked web sites with BeautifulSoup?

Answers (2)

Related Questions