Reputation: 39
I am looking to use a headless browser to scrape some websites and need to use a proxy server.
I'm a bit lost and am looking for help.
When I disable the proxy it works perfectly every time.
When I disable headless mode I get an empty browser window, but if I press enter on the URL bar that has "https://www.whatsmyip.org" the page loads (using the proxy server showing a different IP).
I have the same error for other websites as well, it's not just whatsmyip.org that is having this result.
I am running Centos7, Python 3.6 and Selenium 3.14.0.
I have also tried it on a Windows machine running Anaconda and have the same results.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy, ProxyType
my_proxy = "x.x.x.x:xxxx" #I have a real proxy address here
proxy = Proxy({
'proxyType': ProxyType.MANUAL,
'httpProxy': my_proxy,
'ftpProxy': my_proxy,
'sslProxy': my_proxy,
'noProxy': ''
})
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--allow-insecure-localhost')
chrome_options.add_argument('--allow-running-insecure-content')
chrome_options.add_argument("--ignore-ssl-errors");
chrome_options.add_argument("--ignore-certificate-errors");
chrome_options.add_argument("--ssl-protocol=any");
chrome_options.add_argument('--window-size=800x600')
chrome_options.add_argument('--disable-application-cache')
capabilities = dict(DesiredCapabilities.CHROME)
proxy.add_to_capabilities(capabilities)
capabilities['acceptSslCerts'] = True
capabilities['acceptInsecureCerts'] = True
browser = webdriver.Chrome(executable_path=r'/home/glen/chromedriver', chrome_options=chrome_options, desired_capabilities=capabilities)
browser.get('https://www.whatsmyip.org/')
print(browser.page_source)
browser.close()
When I run the code I get the following returned:
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>
Not the website.
Upvotes: 2
Views: 2628
Reputation: 328750
There are two problems here:
browser.page_source
doesn't return what you want.The first problem is solved by waiting for an element to appear in the DOM. Usually, you will want to scrape something, so you know how to identify the element. Add code to wait until that element exists.
The second problem is that page_source
doesn't return the current DOM but the initial HTML which the browser did load. If JavaScript modified the page since, you won't see it this way.
The solution is to locate the html
element and ask for the outerHtml
property:
from selenium.webdriver.common.by import By
htmlElement = driver.find_element(By.TAG_NAME, "html")
dom = htmlElement.get_attribute("outerHTML")
print(dom)
For details, see the examples at: https://www.seleniumhq.org/docs/03_webdriver.jsp#introducing-the-selenium-webdriver-api-by-example
Upvotes: 3
Reputation: 1251
All of you who didn't solve the problem check this out (python):
options.add_arguments("disable-blink-features=AutomationControlled")
Some sites can detect the automation software and prevent from loading the content properly on purpose.
Source: ChromeDriver with Selenium displays a blank page
Upvotes: 0