Reputation: 45
I am trying to scrape the car details from this site using Selenium: https://www.autoscout24.ch/de/autos/alle-marken?vehtyp=10
Approximately every 30 pages I have to verify that I am not a robot, even though I have included in my code:
driver.implicitly_wait(20)
Is there any way to overcome this?
Upvotes: 0
Views: 15150
Reputation: 239
You could try using proxies or a headless browser like Puppeteer, but if you want to really fullproof your script against CAPTCHAs, might be worth going with a scraping tool with fingerprint emulation and block bypassing technology. I've been using this one because it can be easily integrated into existing Puppeteer/Selenium/Playwright scripts and comes with in-built unblocker technology that automatically handles things like browser fingerprint emulation/header information/cookie management (which solves a major chunk of the problem, keeping websites from flagging you as a potential bot and generating the CAPTCHA in the first place), and can also solve most CAPTCHAs like reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, etc. Pretty much eliminates the need to rely on third-party libraries or additional proxies into your code.
Upvotes: 2
Reputation: 334
2 options come to mind on how to solve your issue, which one you'll choose depends on what you need.
You can just make your script wait when the Captcha is detected, and play a sound when it's shown so you can manually do the captcha yourself, after the captcha has been dealt with you can let the script continue doing it's thing. Refer How to handle Captcha in Selenium
To use a captcha solving service, you would need to pay a little but would not need to manually do anything. Check references mentioned in this answer
Upvotes: 4
Reputation: 193058
The "I'm not a robot" checkbox, commonly known as reCAPTCHA v2 is one of the security measure in practice for implementing challenge-response authentication. CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) mainly helps to protect the applications and the systems from spam and password decryption by asking to complete a simple test that proves it's a human and not a computer trying to access into a password protected account. In short CAPTCHA is implemented to help prevent unauthorized account entry.
So neither of the wait mechanism Implicit wait or Explicit wait would be of any help to avoid CAPTCHA
An ideal approach would be to disable the CAPTCHA for the AUT (Application Under Test) within Testing / Stagging environment and enable it only in production environment.
You can find a couple of relevant detailed discussions in:
Upvotes: 1
Reputation: 4264
CAPTCHA is meant for those reasons. There is no co-relation with it being removed due to use of waits
in Selenium script. The use of CAPTCHA is to detect that bots/automated systems are not crawling the web page.
Unless you disable it, I don't think that it is the right approach to automate it. Although you may find some tutorials on web to overcome it, but they are very patchy and do not cover all the use cases.
Upvotes: 3