Reputation: 940
I know this might be a frequently asked question, but I believe this is a different one.
Cloudflare prevents programmatically sent requests by responding status code 503 and saying "Please turn JavaScript on and reload the page.". Both python requests
module and curl
command raise this error. However, browsing on the same host with Chrome browser is fine, even if under "Incognito" mode.
I have made these attempts but failed to bypass it:
cloudscraper
module. Like thisuser-agent
, cookie
from opened browser page. Like thismechanize
module. Like thisrequests_html
to run JS scripts on the page. Like thisAccording to my inspections, I found that, in a newly opened Chrome Incognito Window, when visiting https://onlinelibrary.wiley.com/doi/full/10.1111/jvs.13069
, the following requests happens:
cookieSet=1
query param, i.e. https://onlinelibrary.wiley.com/doi/full/10.1111/jvs.13069?cookieSet=1
. The response also contains set-cookie
headers. The response has no body.set-cookie
header and has no body.However, in a curl request without redirection enabled (i.e. without -L
arg), I got status code 503 and a HTML response body which says Please turn JavaScript on and reload the page.
.
curl -i -v 'https://onlinelibrary.wiley.com/doi/abs/10.1111/jvs.13069' \
--header 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
--header 'accept-encoding: gzip, deflate, br' \
--header 'accept-language: en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,ja;q=0.6' \
--header 'cache-control: no-cache' \
--header 'cookie: MAID=k4Rf/MejFqG1LjKdveWUPQ==; I2KBRCK=1; rskxRunCookie=0; rCookie=i182bjtmkm3tmujm7wb4xl6fx8wuv; osano_consentmanager_uuid=35ffb0d0-e7e0-487a-a6a5-b35cad9e589f; osano_consentmanager=EtuJH5neWpR-w0VyI9rGqVBE85dQA-2D4f3nUxLGsObfRMLPNtojj-WolqO0wrIvAr3wxlwRPXQuL0CCFvMIDZxXehUBFEicwFqrV4kgDwBshiqbbfh1M3w3V6WWcesS8ZLdPX4iGQ3yTPaxmzpOEBJbeSeY5dByRkR7P2XyOEDAWPT8by2QQjsCt3y3ttreU_M3eV_MJCDCgknIWOyiKdL_FBYJz-ddg8MFAb1N8YBTRQbQAg8r-bSO1vlCqPyWlgzGn-A5xgIDWlCDIpej0Xg2rjA=; JSESSIONID=aaaFppotKtA-t7ze73Rjy; SERVER=WZ6myaEXBLGhNb4JIHwyZ2nYKk2egCfX; MACHINE_LAST_SEEN=2022-08-05T00%3A52%3A30.362-07%3A00; __cf_bm=d9mhQ_ZtETjf41X0VuxDl6GkIZbQtNLJnNIOtDoIPuA-1659685954-0-AXLwPXO1kJb2/IQc+zIesAsL71FoLTgRJqS5M5fxizuFMTw92mMT/yRv5cIq6ZMiRcZE1DchGsO2ZZMdv+/P4JSdUDMAcepY/oXIKFQgauELPNrwiwG/7XYXFRy91+qreazjYASX6Fq0Ir90MNfJ8EcWc10KJyGvSN7QtledQ6Lu9B5S1tqHoxlddPAMOtdL6Q==; lastRskxRun=1659686676640' \
--header 'pragma: no-cache' \
--header 'sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"' \
--header 'sec-ch-ua-mobile: ?0' \
--header 'sec-ch-ua-platform: "macOS"' \
--header 'sec-fetch-dest: document' \
--header 'sec-fetch-mode: navigate' \
--header 'sec-fetch-site: none' \
--header 'sec-fetch-user: ?1' \
--header 'upgrade-insecure-requests: 1' \
--header 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
* Trying 162.159.129.87...
* TCP_NODELAY set
* Connected to onlinelibrary.wiley.com (162.159.129.87) port 443 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
* CAfile: /Users/cosmo/anaconda3/ssl/cacert.pem
CApath: none
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: C=US; ST=California; L=San Francisco; O=Cloudflare, Inc.; CN=sni.cloudflaressl.com
* start date: Apr 17 00:00:00 2022 GMT
* expire date: Apr 17 23:59:59 2023 GMT
* subjectAltName: host "onlinelibrary.wiley.com" matched cert's "onlinelibrary.wiley.com"
* issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
* SSL certificate verify ok.
> GET /doi/abs/10.1111/jvs.13069 HTTP/1.1
> Host: onlinelibrary.wiley.com
> accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
> accept-encoding: gzip, deflate, br
> accept-language: en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,ja;q=0.6
> cache-control: no-cache
> cookie: MAID=k4Rf/MejFqG1LjKdveWUPQ==; I2KBRCK=1; rskxRunCookie=0; rCookie=i182bjtmkm3tmujm7wb4xl6fx8wuv; osano_consentmanager_uuid=35ffb0d0-e7e0-487a-a6a5-b35cad9e589f; osano_consentmanager=EtuJH5neWpR-w0VyI9rGqVBE85dQA-2D4f3nUxLGsObfRMLPNtojj-WolqO0wrIvAr3wxlwRPXQuL0CCFvMIDZxXehUBFEicwFqrV4kgDwBshiqbbfh1M3w3V6WWcesS8ZLdPX4iGQ3yTPaxmzpOEBJbeSeY5dByRkR7P2XyOEDAWPT8by2QQjsCt3y3ttreU_M3eV_MJCDCgknIWOyiKdL_FBYJz-ddg8MFAb1N8YBTRQbQAg8r-bSO1vlCqPyWlgzGn-A5xgIDWlCDIpej0Xg2rjA=; JSESSIONID=aaaFppotKtA-t7ze73Rjy; SERVER=WZ6myaEXBLGhNb4JIHwyZ2nYKk2egCfX; MACHINE_LAST_SEEN=2022-08-05T00%3A52%3A30.362-07%3A00; __cf_bm=d9mhQ_ZtETjf41X0VuxDl6GkIZbQtNLJnNIOtDoIPuA-1659685954-0-AXLwPXO1kJb2/IQc+zIesAsL71FoLTgRJqS5M5fxizuFMTw92mMT/yRv5cIq6ZMiRcZE1DchGsO2ZZMdv+/P4JSdUDMAcepY/oXIKFQgauELPNrwiwG/7XYXFRy91+qreazjYASX6Fq0Ir90MNfJ8EcWc10KJyGvSN7QtledQ6Lu9B5S1tqHoxlddPAMOtdL6Q==; lastRskxRun=1659686676640
> pragma: no-cache
> sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
> sec-ch-ua-mobile: ?0
> sec-ch-ua-platform: "macOS"
> sec-fetch-dest: document
> sec-fetch-mode: navigate
> sec-fetch-site: none
> sec-fetch-user: ?1
> upgrade-insecure-requests: 1
> user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
>
< HTTP/1.1 503 Service Temporarily Unavailable
HTTP/1.1 503 Service Temporarily Unavailable
< Date: Fri, 05 Aug 2022 08:56:14 GMT
Date: Fri, 05 Aug 2022 08:56:14 GMT
< Content-Type: text/html; charset=UTF-8
Content-Type: text/html; charset=UTF-8
< Transfer-Encoding: chunked
Transfer-Encoding: chunked
< Connection: close
Connection: close
< X-Frame-Options: SAMEORIGIN
X-Frame-Options: SAMEORIGIN
< Permissions-Policy: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
Permissions-Policy: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
< Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< Expires: Thu, 01 Jan 1970 00:00:01 GMT
Expires: Thu, 01 Jan 1970 00:00:01 GMT
< Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< Set-Cookie: __cf_bm=Z8oUUTMhz8K.._yzicdZVzO49fmFKCtgS2CDTlnFvpU-1659689774-0-ARUAfH3m6VNwz09gKVsRECZkXJf5BdqNsW+oIPcy1oKzvppiMWxz7HGFkEwMuGHGzrHRDy5nV+VVj74AxTN8ThozSiHa/8sYH0IwMMe62woC; path=/; expires=Fri, 05-Aug-22 09:26:14 GMT; domain=.onlinelibrary.wiley.com; HttpOnly; Secure; SameSite=None
Set-Cookie: __cf_bm=Z8oUUTMhz8K.._yzicdZVzO49fmFKCtgS2CDTlnFvpU-1659689774-0-ARUAfH3m6VNwz09gKVsRECZkXJf5BdqNsW+oIPcy1oKzvppiMWxz7HGFkEwMuGHGzrHRDy5nV+VVj74AxTN8ThozSiHa/8sYH0IwMMe62woC; path=/; expires=Fri, 05-Aug-22 09:26:14 GMT; domain=.onlinelibrary.wiley.com; HttpOnly; Secure; SameSite=None
< Vary: Accept-Encoding
Vary: Accept-Encoding
< Strict-Transport-Security: max-age=15552000
Strict-Transport-Security: max-age=15552000
< Server: cloudflare
Server: cloudflare
< CF-RAY: 735e5184085e52cb-LAX
CF-RAY: 735e5184085e52cb-LAX
<
<!DOCTYPE HTML>
<html lang="en-US">
...... (HTML codes saying "Please turn JavaScript on and reload the page")
The HTML looks like this as rendered by Postman:
And yes, Postman can neither visit the url.
According to these observations, I believe that the site behaves differently when receiving a first request from browser and curl
. But I don't know how does Cloudflare tells between a human being (using a browser) and a bot (using curl
). As I have described before, the two kinds of clients have no difference in:
Upvotes: 2
Views: 3519
Reputation: 11
I ran into the same problem and solved it by adding headers to one of the solutions, using cloudscraper, already mentioned in the initial post.
import cloudscraper
url = 'https://onlinelibrary.wiley.com/doi/full/10.1111/jvs.13069'
headers = {
'Wiley-TDM-Client-Token': 'your-Wiley-TDM-Client-Token',
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
scraper = cloudscraper.create_scraper()
r = scraper.get(url, headers = headers)
print(r.status_code)
This code snipped returns a status_code 200
It seems like Cloudflare looks for the presence of the 'Wiley-TDM-Client-Token' header, and not for the actual token as removing it from the headers results in a 'Detected a Cloudflare version 2 challenge' error. The code posted here already works without adding your own TDM key, but I expect that for paid content a TDM with the right licenses is needed.
Upvotes: 1