Reputation: 78
I'm trying to get the data from https://www.ecfr.gov/cgi-bin/ECFR?page=browse using requests module in python
Somehow I'm getting HTTP 403-forbidden.
header = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Host": "httpbin.org",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5ef3288f-10e678d0e55c0670c0807730"}
r = requests.get(url , headers= header)
I have also requested using user-agent and all the parameters in headers info(which I'm seeing in developer tools) .
I have tried using free proxies / rotating user header /cookies and everything i can get my hands on. But somehow website is able to know that I'm not using header.
In the html response - I'm seeing that website is asking to complete captcha.
Is there anyways I can skip that ?
Upvotes: 0
Views: 458
Reputation: 6277
Inspecting the http requests, I've found the cloudflare server response trace:
The Cloudflare or ScrapeShield is famous for its scrape protection, security levels. Read more here.
Is there anyways I can skip that ?
There are 2 ways out:
Apply (plug-in) a captcha solving service. That is not that easy providing you use sole python coding.
Leverage the browser automation, making ScrapeShield to think that a real user browses the website. It does take much more resources and time (incl. development time). See a scrape speed comparison table of Chromium headless instance automation vs bare http requests.
Upvotes: 1