Glen
Glen

Reputation: 39

Headless Chrome returning empty HTML when using a Proxy

I am looking to use a headless browser to scrape some websites and need to use a proxy server.

I'm a bit lost and am looking for help.

When I disable the proxy it works perfectly every time.

When I disable headless mode I get an empty browser window, but if I press enter on the URL bar that has "https://www.whatsmyip.org" the page loads (using the proxy server showing a different IP).

I have the same error for other websites as well, it's not just whatsmyip.org that is having this result.

I am running Centos7, Python 3.6 and Selenium 3.14.0.

I have also tried it on a Windows machine running Anaconda and have the same results.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy, ProxyType

my_proxy = "x.x.x.x:xxxx" #I have a real proxy address here
proxy = Proxy({
    'proxyType': ProxyType.MANUAL,
    'httpProxy': my_proxy,
    'ftpProxy': my_proxy,
    'sslProxy': my_proxy,
    'noProxy': ''
})

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--allow-insecure-localhost')
chrome_options.add_argument('--allow-running-insecure-content')
chrome_options.add_argument("--ignore-ssl-errors");
chrome_options.add_argument("--ignore-certificate-errors");
chrome_options.add_argument("--ssl-protocol=any");        
chrome_options.add_argument('--window-size=800x600')
chrome_options.add_argument('--disable-application-cache')

capabilities = dict(DesiredCapabilities.CHROME)
proxy.add_to_capabilities(capabilities)
capabilities['acceptSslCerts'] = True
capabilities['acceptInsecureCerts'] = True

browser = webdriver.Chrome(executable_path=r'/home/glen/chromedriver', chrome_options=chrome_options, desired_capabilities=capabilities)

browser.get('https://www.whatsmyip.org/')

print(browser.page_source)     

browser.close()

When I run the code I get the following returned:

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>

Not the website.

Upvotes: 2

Views: 2628

Answers (2)

Aaron Digulla
Aaron Digulla

Reputation: 328750

There are two problems here:

  1. You need to wait for the browser to load the web site.
  2. browser.page_source doesn't return what you want.

The first problem is solved by waiting for an element to appear in the DOM. Usually, you will want to scrape something, so you know how to identify the element. Add code to wait until that element exists.

The second problem is that page_source doesn't return the current DOM but the initial HTML which the browser did load. If JavaScript modified the page since, you won't see it this way.

The solution is to locate the html element and ask for the outerHtml property:

from selenium.webdriver.common.by import By
htmlElement = driver.find_element(By.TAG_NAME, "html")
dom = htmlElement.get_attribute("outerHTML")
print(dom)

For details, see the examples at: https://www.seleniumhq.org/docs/03_webdriver.jsp#introducing-the-selenium-webdriver-api-by-example

Upvotes: 3

lenord
lenord

Reputation: 1251

All of you who didn't solve the problem check this out (python):

options.add_arguments("disable-blink-features=AutomationControlled")

Some sites can detect the automation software and prevent from loading the content properly on purpose.

Source: ChromeDriver with Selenium displays a blank page

Upvotes: 0

Related Questions