Reputation: 999
I'm looking to find a ul
tag from a page using:
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(<chromedriver path>, chrome_options=options)
driver.get("https://www.atptour.com/en/rankings/singles")
html = driver.page_source
soup = bs(html, "html.parser")
dropdown = soup.find("ul", class_="dropdown")
dropdown
ends up being None
because not all the HTML is scraped.
However, if I remove the headless
option then all the HTML is scraped and I have a result for dropdown
.
Why does this happen and is there some way of running in headless and still scraping all the HTML?
Thanks in advance.
Upvotes: 1
Views: 618
Reputation: 9969
Headless runs a different user agent using the below fixed the error due to bot detection. Also adding window size makes finding elements a little bit safer.
options.add_argument("--window-size=1920,1080")
agent="Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1866.237 Safari/537.36"
options.add_argument(f'user-agent={agent}')
Upvotes: 1
Reputation: 8676
If you print that html (returned by headless mode) to a file and open in a browser you will see:
So your interaction is blocked by a CAPTCHA. Why headless mode in chrome is blocked but regular is not? No idea, this is how their identification algorithms work.
Upvotes: 1