Reputation: 888
So I have a problem that I have been noticing with selenium when I run it headless where some pages don't totally load/render some elements. I don't exactly know what's happening not to load 100%; maybe JS
not running?
My code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from decouple import config
from time import sleep
DEBUG = config('DEBUG')
class DiscordME(object):
def __init__(self):
self.LINUX = config('LINUX', cast=bool)
self.DRIVER_VERSION = config('DRIVER_VERSION')
self.HEADLESS = True
options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-extensions')
options.add_argument('--disable-dev-shm-usage')
if self.HEADLESS:
options.add_argument('--headless')
options.add_argument('--window-size=1920,1200')
if self.LINUX:
self.browser = webdriver.Chrome(executable_path=f'./drivers/chromedriver-{self.DRIVER_VERSION}', options=options)
else:
self.browser = webdriver.Chrome(executable_path=f'.\drivers\chromedriver-{self.DRIVER_VERSION}.exe', options=options)
def get_website(self):
self.browser.get('https://discord.me/login')
WebDriverWait(self.browser, 10).until(
EC.url_changes('https://discord.me/login')
)
print(self.browser.current_url)
print(self.browser.page_source)
#print(self.browser.find_element_by_xpath('//*[@id="app-mount"]/div[2]/div/div[2]/div/div/form/div/div/div[1]/div[3]/div[1]/div/div[2]/input'))
DiscordME().get_website()
In this script, it doesn't load the login inputs when it accesses the discord API login page.
As I can see in the page_source
I noticed that the page is not being mounted so that could be the problem.
Upvotes: 8
Views: 9820
Reputation: 1559
Another thing to to consider if you are having trouble loading a website with selenium is the processing power.
I was using a Micro AWS instance
with a single CPU which worked for many websites, but when I came to a more complex one it kept intermittently getting 0 elements
when conducting a search like find_elements_by_xpath('//a[@href]')
while sometimes it would work successfully and find the hyperlinks. I upgraded the instance to one with more CPUs (4, but 2 would probably have been sufficient) and that allowed me to fully load the site and scrape the elements.
I would definitely try the other two solutions posted here first (chrome options or firefox browser), but processing power could be the problem as well.
Upvotes: 2
Reputation: 1956
I Just would like to share my experience on this as solving the issue consumed much of my time trying many options and settings for Chrome webdriver.
The user-agent
setting solved the problem for some websites I scraped. but, for some other websites the only solution worked with me was to use FireFox webdriver instead of Chrome as per following :
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
fireFoxOptions = Options()
fireFoxOptions.add_argument("--headless")
fireFoxOptions.add_argument("--window-size=1920,1080")
fireFoxOptions.add_argument('--start-maximized')
fireFoxOptions.add_argument('--disable-gpu')
fireFoxOptions.add_argument('--no-sandbox')
driver = webdriver.Firefox(options=fireFoxOptions,
executable_path=r'C:\[your path to firefox webdriver exe file]\geckodriver.exe')
driver.get('https://discord.me/login')
Use the link here to download latest geckodriver for FireFox, and make sure FireFox browser is already installed in you machine.
Upvotes: 1
Reputation: 19949
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
browser = webdriver.Chrome(options=options)
some websites uses user-agent to detect whether the browser is in headless mode or not as headless browser uses a different user-agent than normal browser. So explicitly set user agent.
Upvotes: 24