simplepineapple
simplepineapple

Reputation: 469

Selenium headless: How to bypass Cloudflare detection using Selenium

Hoping an expert can help me with a Selenium/Cloudflare mystery. I can get a website to load in normal (non-headless) Selenium, but no matter what I try, I can't get it to load in headless.

I have followed the suggestions from the StackOverflow posts like Is there a version of Selenium WebDriver that is not detectable?. I've also looked at all the properties of window and window.navigator objects and fixed all the diffs between headless and non-headless, but somehow headless is still being detected. At this point I am extremely curious how Cloudflare could possibly figure out the difference. Thank you for the time!

List of the things I have tried:

Replicating the experiment

In order to get the website to load in normal (non-headless) Selenium, you have to follow a _blank link from another website (so that the target website opens in another tab). To replicate the experiment, first create an html file with the content <a href="https://poocoin.app" target="_blank">link</a>, and then paste the path to this html file in the following code.

The version below (non-headless) runs fine and loads the website, but if you set options.headless = True, it will get stuck on Cloudflare.

from selenium import webdriver
import time

# Replace this with the path to your html file
FULL_PATH_TO_HTML_FILE = 'file:///Users/simplepineapple/html/url_page.html'

def visit_website(browser):
    browser.get(FULL_PATH_TO_HTML_FILE)
    time.sleep(3)

    links = browser.find_elements_by_xpath("//a[@href]")
    links[0].click()
    time.sleep(10)

    # Switch webdriver focus to new tab so that we can extract html
    tab_names = browser.window_handles
    if len(tab_names) > 1:
        browser.switch_to.window(tab_names[1])

    time.sleep(1)
    html = browser.page_source
    print(html)
    print()
    print()

    if 'Charts' in html:
        print('Success')
    else:
        print('Fail')

    time.sleep(10)


options = webdriver.ChromeOptions()
# If options.headless = True, the website will not load
options.headless = False
options.add_argument("--window-size=1920,1080")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36')

browser = webdriver.Chrome(options = options)

browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    "source": '''
    Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined
    });
    Object.defineProperty(navigator, 'plugins', {
            get: function() { return {"0":{"0":{}},"1":{"0":{}},"2":{"0":{},"1":{}}}; }
    });
    Object.defineProperty(navigator, 'languages', {
        get: () => ["en-US", "en"]
    });
    Object.defineProperty(navigator, 'mimeTypes', {
        get: function() { return {"0":{},"1":{},"2":{},"3":{}}; }
    });

    window.screenY=23;
    window.screenTop=23;
    window.outerWidth=1337;
    window.outerHeight=825;
    window.chrome =
    {
      app: {
        isInstalled: false,
      },
      webstore: {
        onInstallStageChanged: {},
        onDownloadProgress: {},
      },
      runtime: {
        PlatformOs: {
          MAC: 'mac',
          WIN: 'win',
          ANDROID: 'android',
          CROS: 'cros',
          LINUX: 'linux',
          OPENBSD: 'openbsd',
        },
        PlatformArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        PlatformNaclArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        RequestUpdateCheckStatus: {
          THROTTLED: 'throttled',
          NO_UPDATE: 'no_update',
          UPDATE_AVAILABLE: 'update_available',
        },
        OnInstalledReason: {
          INSTALL: 'install',
          UPDATE: 'update',
          CHROME_UPDATE: 'chrome_update',
          SHARED_MODULE_UPDATE: 'shared_module_update',
        },
        OnRestartRequiredReason: {
          APP_UPDATE: 'app_update',
          OS_UPDATE: 'os_update',
          PERIODIC: 'periodic',
        },
      },
    };
    window.navigator.chrome =
    {
      app: {
        isInstalled: false,
      },
      webstore: {
        onInstallStageChanged: {},
        onDownloadProgress: {},
      },
      runtime: {
        PlatformOs: {
          MAC: 'mac',
          WIN: 'win',
          ANDROID: 'android',
          CROS: 'cros',
          LINUX: 'linux',
          OPENBSD: 'openbsd',
        },
        PlatformArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        PlatformNaclArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        RequestUpdateCheckStatus: {
          THROTTLED: 'throttled',
          NO_UPDATE: 'no_update',
          UPDATE_AVAILABLE: 'update_available',
        },
        OnInstalledReason: {
          INSTALL: 'install',
          UPDATE: 'update',
          CHROME_UPDATE: 'chrome_update',
          SHARED_MODULE_UPDATE: 'shared_module_update',
        },
        OnRestartRequiredReason: {
          APP_UPDATE: 'app_update',
          OS_UPDATE: 'os_update',
          PERIODIC: 'periodic',
        },
      },
    };
    ['height', 'width'].forEach(property => {
        const imageDescriptor = Object.getOwnPropertyDescriptor(HTMLImageElement.prototype, property);

        // redefine the property with a patched descriptor
        Object.defineProperty(HTMLImageElement.prototype, property, {
            ...imageDescriptor,
            get: function() {
                // return an arbitrary non-zero dimension if the image failed to load
            if (this.complete && this.naturalHeight == 0) {
                return 20;
            }
                return imageDescriptor.get.apply(this);
            },
        });
    });

    const getParameter = WebGLRenderingContext.getParameter;
    WebGLRenderingContext.prototype.getParameter = function(parameter) {
        if (parameter === 37445) {
            return 'Intel Open Source Technology Center';
        }
        if (parameter === 37446) {
            return 'Mesa DRI Intel(R) Ivybridge Mobile ';
        }

        return getParameter(parameter);
    };

    const elementDescriptor = Object.getOwnPropertyDescriptor(HTMLElement.prototype, 'offsetHeight');

    Object.defineProperty(HTMLDivElement.prototype, 'offsetHeight', {
        ...elementDescriptor,
        get: function() {
            if (this.id === 'modernizr') {
            return 1;
            }
            return elementDescriptor.get.apply(this);
        },
    });
    '''
})

visit_website(browser)

browser.quit()

Upvotes: 46

Views: 104568

Answers (10)

qfcy
qfcy

Reputation: 145

As Object.defineProperty(navigator, 'webdriver', {get: () => undefined}); removes the window.navigator.webdriver attribute, this can be resolved simply by using a daemon thread:

def daemon(driver,cookie_path):
    while True:
        try:
            time.sleep(1)
            try:
                if not driver.window_handles:break
            except Exception:
                break
            if cookie_path is not None:
                cookies = driver.get_cookies() # Automatically save cookies at regular intervals
                with open(cookie_path, "w", encoding="utf-8") as file:
                    json.dump(cookie_to_json(cookies), file)
            driver.execute_script(
                "try{Object.defineProperty(navigator, 'webdriver', {get: () => undefined});}catch(e){}"
            )
        except Exception as err:
            warn("Failed (%s): %s" % (type(err).__name__,str(err))) # The thread will exit only when the main thread exits

Upvotes: 0

Dineth Oshitha
Dineth Oshitha

Reputation: 11

It will make accessible through UC mode of seleniumbase. Just replace the URL.

from seleniumbase import Driver

try:
      driver = Driver(uc=True)
      driver.uc_open_with_reconnect(url, 4)
      driver.uc_gui_click_captcha()
      time.sleep(10)
      driver.uc_gui_click_captcha()
except:
      traceback.print_exc()
            
time.sleep(10)

Upvotes: 1

Mahmoud Magdy
Mahmoud Magdy

Reputation: 941

so based on @undetected-selenium answer to achieve best hidden exp use both

    import undetected_chromedriver as uc
    from selenium import webdriver
    from selenium_stealth import stealth
    
    options = webdriver.ChromeOptions() 
    options.headless = True
    options.add_argument("start-maximized")
    
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = uc.Chrome(options=options, executable_path=r"C:\path\to\chromedriver.exe")
    
    stealth(driver,
            languages=["en-US", "en"],
            vendor="Google Inc.",
            platform="Win32",
            webgl_vendor="Intel Inc.",
            renderer="Intel Iris OpenGL Engine",
            fix_hairline=True,
            )
    
    driver.get("https://logical.com/")

Upvotes: 0

Muhammad Mobeen
Muhammad Mobeen

Reputation: 143

I have mixed both the libraries undetected-chromedriver and selenium-stealth, which has solved my problem. It is no longer detectable by Cloudflare Challenge.

Following is a function that I am using to generate a driver for me:

import undetected_chromedriver as uc
from selenium_stealth import stealth

def gen_driver(self):
    try:
        user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.140 Safari/537.36"
        chrome_options = uc.ChromeOptions()
        chrome_options.add_argument('--headless=new')
        chrome_options.add_argument("--start-maximized")
        chrome_options.add_argument("user-agent={}".format(user_agent))
        driver = uc.Chrome(options=chrome_options)
        stealth(driver,
                languages=["en-US", "en"],
                vendor="Google Inc.",
                platform="Win32",
                webgl_vendor="Intel Inc.",
                renderer="Intel Iris OpenGL Engine",
                fix_hairline=True
        )
        return driver
    except Exception as e:
        print("Error in Driver: ",e)

In the selenium-stealth documentation it was recommended to add the following options too:

chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

However, these options do not work with undetected-chromedriver so I removed them. Everything else is the same.

Not sure why this works but my guess is that selenium-stealth adds some render information that bypasses Cloudflare.

Upvotes: 4

puchu
puchu

Reputation: 3662

You need to read source code of latest chromium. It removes large amount of functionality in headless mode. What are cloudflare developers doing? They are finding places where is headless mode is used and trying to separate headless and not headless objects behaviour. There are many workarounds in chromium today that makes internal headless mode detection the easy task.

Meanwhile I can't understand guys using internal chromium headless mode. You can just use headless wayland or headless X11 mode and forget about this case. It will help to concentrate on more important things.

Upvotes: 3

Vano Varderesyan
Vano Varderesyan

Reputation: 1

pip install undetected-chromedriver

You can use this module

Upvotes: -2

Nikita
Nikita

Reputation: 37

The only thing I can suggets in addition - to improove your plugins and mime types for navigator sometimes can be use property as typeof(navigator.plugins, 'PluginsArray')

Object.defineProperty(navigator, 'plugins', {
    get: () => {
        var ChromiumPDFPlugin = {};
        var plugin = {
            ChromiumPDFPlugin,
            description: 'Portable Document Format',
            filename: 'internal-pdf-viewer',
            length: 1,
            name: 'Chromium PDF Plugin',

        };
        plugin.__proto__ = Plugin.prototype;

        var plugins = {
            0: plugin,
            length: 1
        };
        plugins.__proto__ = PluginArray.prototype;
        return plugins;
    },
});

Object.defineProperty(navigator, 'mimeTypes', {
    get: () => {
        var mimeType = {
            type: 'application/pdf',
            suffixes: 'pdf',
            description: 'Portable Document Format',
            enabledPlugin: Plugin

        };
        mimeType.__proto__ = MimeType.prototype;

        var mimeTypes = {
            0: mimeType,
            length: 1
        };
        mimeTypes.__proto__ = MimeTypeArray.prototype;
        return mimeTypes;
    },
});

Good website to check what's going wrong in headless mode is https://bot.sannysoft.com/

You can run in headless mode and create page snapshot to check if all passed

P.s. also, sometimes, even if navigator.webdriver is set to undefined, navigator still contains webdriver prop You can simply rm using code below:

const newProto = navigator.__proto__;
delete newProto.webdriver;
navigator.__proto__ = newProto;

Upvotes: 0

Den Pat
Den Pat

Reputation: 1284

@undetected Selenium's answer works perfectly with https://github.com/diprajpatra/selenium-stealth

If you are using the latest version of selenium, you will need to change executable_path parameter as it's depreciated, example code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium_stealth import stealth

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)

stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
)

driver.get("https://bot.sannysoft.com/")

print(driver.find_element(By.XPATH, "/html/body").text)

driver.close()

Upvotes: 1

undetected Selenium
undetected Selenium

Reputation: 193298

Using the latest Google Chrome v96.0 if you retrive the useragent

  • For the browser the following is in use:

    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
    
  • Where as for browser the following is in use:

    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/96.0.4664.110 Safari/537.36
    

In majority of the cases the presence of the additional Headless string/parameter/attribute is intercepted as a and blocks the access to the website.


Solution

There are different approaches to evade the Cloudflare detection even using Chrome in mode and some of the efficient approaches are as follows:

  • An efficient solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context. undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.

    • Code Block:

      import undetected_chromedriver as uc
      from selenium import webdriver
      
      options = webdriver.ChromeOptions() 
      options.headless = True
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = uc.Chrome(options=options)
      driver.get('https://bet365.com')
      

You can find a couple of relevant detailed discussions in:

  • The most efficient solution would be to use Selenium Stealth to initialize the Chrome Browsing Context. selenium-stealth is a python package to prevent detection. This programme tries to make python selenium more stealthy.

    • Code Block:

      from selenium import webdriver
      from selenium_stealth import stealth
      
      options = webdriver.ChromeOptions()
      options.add_argument("start-maximized")
      options.add_argument("--headless")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r"C:\path\to\chromedriver.exe")
      
      stealth(driver,
              languages=["en-US", "en"],
              vendor="Google Inc.",
              platform="Win32",
              webgl_vendor="Intel Inc.",
              renderer="Intel Iris OpenGL Engine",
              fix_hairline=True,
              )
      
      driver.get("https://bot.sannysoft.com/")
      

You can find a couple of relevant detailed discussions in:

Upvotes: 42

Franz Gastring
Franz Gastring

Reputation: 1130

The cloudflare protection IUAM is used primary to avoid ddos attacks and for consequence it also protect sites from automation bot exploitation so no matter what you are using in the client side the cloudflare server is fingerprinting you. After that they send to the client side the cf_clearance a cookie that allows you to connect for the next 15 minutes.

enter image description here

Upvotes: -1

Related Questions