Jossy
Jossy

Reputation: 999

Why isn't all HTML scraped when in headless mode?

I'm looking to find a ul tag from a page using:

options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(<chromedriver path>, chrome_options=options)
driver.get("https://www.atptour.com/en/rankings/singles")
html = driver.page_source
soup = bs(html, "html.parser")
dropdown = soup.find("ul", class_="dropdown")

dropdown ends up being None because not all the HTML is scraped.

However, if I remove the headless option then all the HTML is scraped and I have a result for dropdown.

Why does this happen and is there some way of running in headless and still scraping all the HTML?

Thanks in advance.

Upvotes: 1

Views: 618

Answers (2)

Arundeep Chohan
Arundeep Chohan

Reputation: 9969

Headless runs a different user agent using the below fixed the error due to bot detection. Also adding window size makes finding elements a little bit safer.

options.add_argument("--window-size=1920,1080")
agent="Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1866.237 Safari/537.36"
options.add_argument(f'user-agent={agent}')

Upvotes: 1

Alexey R.
Alexey R.

Reputation: 8676

If you print that html (returned by headless mode) to a file and open in a browser you will see:

enter image description here

So your interaction is blocked by a CAPTCHA. Why headless mode in chrome is blocked but regular is not? No idea, this is how their identification algorithms work.

Upvotes: 1

Related Questions