pippen
pippen

Reputation: 1

Playwright getting different results if in headless mode (Error 405 banned)

#!/usr/bin/env python3

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json
from pathlib import Path

def scrape_urlhaus():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True, slow_mo=1200)
        context = browser.new_context(
            viewport={'width': 1366, 'height': 768},
        )
        context.add_init_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
        context.add_cookies(json.loads(Path("urlhaus_cookies.json").read_text()))
        page = context.new_page()
        page.goto("https://urlhaus.abuse.ch/")
        page.screenshot(path="debug_screenshot.png")
        page.wait_for_selector('a.nav-link[href="/browse/"]')
        page.click('a.nav-link[href="/browse/"]')
        page.wait_for_selector('table.table.table-sm.table-hover.table-bordered')
        content = page.content()

I am writing a program to scrape URLHaus browse page, however the script does work when I set headless=True and I get the error, but this is not there when I am using headless=False as I get the normal page view.

Upvotes: -1

Views: 61

Answers (1)

Hobbamok
Hobbamok

Reputation: 848

Since it apparently wasn't obvious enough: Being in headless mode triggers their bot-detection and therefore blocks the client

How exactly this is done and how it could be bypassed would require insight into their website code, which they are unlikely to share. As usual there is an arms race between people who want to automate and people who don't want bots on their site, but in terms of puppeteer's headless:false, this battle is lost, since it's too easy to detect.

Upvotes: 0

Related Questions