Shashank Mistry
Shashank Mistry

Reputation: 78

Python - WebScraping using Request module-URL throws an error -403- forbidden

I'm trying to get the data from https://www.ecfr.gov/cgi-bin/ECFR?page=browse using requests module in python

Somehow I'm getting HTTP 403-forbidden.

header = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
"Accept-Encoding": "gzip, deflate, br", 
"Accept-Language": "en-US,en;q=0.9", 
"Cache-Control": "max-age=0", 
"Host": "httpbin.org", 
"Sec-Fetch-Dest": "document", 
"Sec-Fetch-Mode": "navigate", 
"Sec-Fetch-Site": "none", 
"Sec-Fetch-User": "?1", 
"Upgrade-Insecure-Requests": "1", 
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36", 
"X-Amzn-Trace-Id": "Root=1-5ef3288f-10e678d0e55c0670c0807730"}

r = requests.get(url , headers= header)

I have also requested using user-agent and all the parameters in headers info(which I'm seeing in developer tools) .

I have tried using free proxies / rotating user header /cookies and everything i can get my hands on. But somehow website is able to know that I'm not using header.

In the html response - I'm seeing that website is asking to complete captcha.

Is there anyways I can skip that ?

Upvotes: 0

Views: 458

Answers (1)

Igor Savinkin
Igor Savinkin

Reputation: 6277

Inspecting the http requests, I've found the cloudflare server response trace:

enter image description here

The Cloudflare or ScrapeShield is famous for its scrape protection, security levels. Read more here.

Is there anyways I can skip that ?

There are 2 ways out:

  1. Apply (plug-in) a captcha solving service. That is not that easy providing you use sole python coding.

  2. Leverage the browser automation, making ScrapeShield to think that a real user browses the website. It does take much more resources and time (incl. development time). See a scrape speed comparison table of Chromium headless instance automation vs bare http requests.

Upvotes: 1

Related Questions