Nickolas
Nickolas

Reputation: 353

Can't bypass cloudflare with python cloudscraper

I faced with cloudflare issue when I tried to parse the website.

I got this code

import cloudscraper

url = "https://author.today"
scraper = cloudscraper.create_scraper()
print(scraper.post(url).status_code)

This code prints me

cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.

I searched for workaround, but couldn't find any solution. If visit the website via a browser you could see

Checking your browser before accessing author.today.

Is there any solution to bypass cloudflare in my case?

Upvotes: 19

Views: 42249

Answers (6)

fab23
fab23

Reputation: 77

I can suggest such workflow to "try" to avoid Cloudflare WAF/bot mitigation:

  • don't cycle user agents, proxies or weird tunnels to surf
  • don't use fixed ip addresses, better leased lines like xDSL, home links and 4G/LTE
  • try to appear as mobile instead of a desktop/tablet
  • try to reproduce pointer movements like never before AKA record your mouse moves and migrate them 1:1 while scraping (yes u need JS enabled and some headless browser able to make up as "common" one)
  • don't cycle against different Cloudflare protected entities otherwise the attacker ip will be greylisted in a minute (AKA build your own targets blacklist, never touch such entities or you will go in the CF blacklist flawlessy)
  • try to reproduce a real life navigation in all aspects, including errors, waitings and more
  • check your used ip after any scrape against popular blacklists otherwise bad errors will shortly appears (crowdsec is a good starting point)
  • the usual scrape is a googlebot scrape, a single regex WAF rule on CLoudflare will block 99,99% of the tries then.. avoid to fake as google and try to be LESS evil instead (ex: asking webmasters for APIs or data export if any).

Source: I use Cloudflare with hundreds of domains and thousands of records (Enterprise) from the beginning of the company.

That way you will be closer to the point (and you will help them increasing the overall security).

Upvotes: 2

Zorome
Zorome

Reputation: 148

Install httpx

pip3 install httpx[http2]

Define http2 client

client = httpx.Client(http2=True)

Make request

response = client.get("https://author.today")

Cheers!

Upvotes: 6

Pierluigi Vinciguerra
Pierluigi Vinciguerra

Reputation: 104

I'd try to create a Playwright scraper that mimics a real user, this works for me most of the time, just need to find the right settings (they can vary from website to website). Otherwise, if the website has a native App, try to figure out how the App behaves and then mimic it.

Upvotes: 0

user2284144
user2284144

Reputation: 93

I used this line: scraper = cloudscraper.create_scraper(browser={'browser': 'chrome','platform': 'windows','mobile': False})

and then used httpx package after that with httpx.Client() as s: //Remaining Code

And I was able to bypass the issue cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.

Upvotes: -4

Hello
Hello

Reputation: 7

import cfscrape
from fake_useragent import UserAgent
ua = UserAgent()

s = cfscrape.create_scraper()

k = s.post("https://author.today", headers = {"useragent": f"{ua.random}"})
print(k)

Upvotes: -1

dcts
dcts

Reputation: 1639

Although for this site is does not seem to work, sometimes adding some parameters when initializing the scraper helps:

import cloudscraper

url = "https://author.today"
scraper = cloudscraper.create_scraper(
    browser={
        'browser': 'chrome',
        'platform': 'android',
        'desktop': False
    }
)
print(scraper.post(url).status_code)

Upvotes: 1

Related Questions