Python - urllib3 recieve 403 'Forbidden' while crawling websites

Question

I am using Python3 and urllib3 for crawling and downloading websites. I crawled a list of 4000 different domains and at about 5 of them i got back HttpErrorCode - 403 - 'Forbidden'.

On my browser the website does exist and respond correctly. Probably these websites are suspecting me as a crawler and forbid me from getting the data.

This is my code:

from urllib3 import PoolManager, util, Retry
import certifi as certifi
from urllib3.exceptions import MaxRetryError

manager = PoolManager(cert_reqs='CERT_REQUIRED',
                               ca_certs=certifi.where(),
                               num_pools=15,
                               maxsize=6,
                               timeout=40.0,
                               retries=Retry(connect=2, read=2, redirect=10))
url_to_download = "https://www.uvision.co.il/"
headers = util.make_headers(accept_encoding='gzip, deflate',
                                keep_alive=True,
                                user_agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0")
headers['Accept-Language'] = "en-US,en;q=0.5"
headers['Connection'] = 'keep-alive'
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
try:
    response = manager.request('GET',
                               url_to_download,
                               preload_content=False,
                               headers=headers)
except MaxRetryError as ex:
    raise FailedToDownload()

Example of websites that have rejected me: https://www.uvision.co.il/ and http://www.medyummesut.net/.

Another website that don't work and Throws MaxRetryError is:

http://www.nytimes.com/2015/10/28/world/asia/south-china-sea-uss-lassen-spratly-islands.html?hp&action=click&pgtype=Homepage&module=first-column-region®ion=top-news&WT.nav=top-news&_r=1

I've also tried to use the exact same headers that Firefox use and it didn't work either. Am i doing here something wrong?

James K · Accepted Answer

You specify keep_alive=True, which adds a header connection: keep-alive

You then also add a header Connection: keep-alive (note the slight difference in case). And this seems to be causing the problem. To fix it just remove the redundant line

headers['Connection'] = 'keep-alive'

Python - urllib3 recieve 403 'Forbidden' while crawling websites

Answers (1)

Related Questions

Python - urllib3 recieve 403 &#39;Forbidden&#39; while crawling websites

Answers (1)

Related Questions

Python - urllib3 recieve 403 'Forbidden' while crawling websites