Reputation: 841
I am trying to scrape a certain website let's call it "https://some-website.com". For the past few months I was able to do it without problems however a few days ago I noticed the scraper no longer works as all requests return a 403 Forbidden status.
For the last 3 months I was using the below code to scrape the data.
import requests
from fake_useragent import UserAgent
res = requests.get(<url>, headers={'User-Agent': UserAgent().random})
This always returned a nice 200 OK with the page I needed. Until a few days ago I started getting a 403 Forbidden error. And somewhere in the return text I can spot a sentence "Enable JavaScript and cookies to continue".
As you can see in the code I already randomly switch user-agent header which is usually the recommendation to fix this kind of problems.
Naturally I suspected they blacklisted my IP (maybe in combination with some user agents and don't allow me to scrape). However I implemented a solution to use a proxy and I still get a 403.
import requests
from fake_useragent import UserAgent
proxies = {
"https": f'http://some_legit_proxy',
"http": f'http://some_legit_proxy',
}
res = requests.get(<url>, headers={'User-Agent': UserAgent().random}, proxies=proxies)
The proxy is a residential proxy.
What baffles me the most is that if I remove the random user-agent part and use the default requests user-agent the scrape all of a sudden works.
import requests
res = requests.get(<url>) # 'User-Agent': 'python-requests/2.28.1'
# 200 OK
This tells me that it doesn't mean the website all of a sudden need javascript as the scrape does work it just seems they are somehow blocking me.
I have a few ideas in mind to work around this but as I don't understand how is this happening I cannot be sure this will be scalable in the future.
Please help me understand what is happening here.
Upvotes: 4
Views: 3039
Reputation: 239
You'll probably need rotating proxies or something that hides/bypasses device fingerprinting that might be exposing your script as a bot.
Reference:
How Web Unlocker is enabling better fingerprinting, auto-unlocking, and CAPTCHA-solving
Upvotes: 5
Reputation: 1135
as @h0r53 mentioned, I think cloudfare detects if a request is made by js.
You could try using this answer
Upvotes: 0
Reputation: 3219
The site in question is hosted by Cloudflare. Cloudflare does things like TLS fingerprinting on the edge which will determine the User-Agent you've provided doesn't match the TLS fingerprint from Python's request module. This is a common technique used by cloud providers as a means for bot deterrence. I'd recommend first trying to scrape without spoofing the user agent and if you still have troubles consider a modern browser automation platform such as Puppeteer.
Good luck friend. :)
Upvotes: 1