Reputation: 1
#!/usr/bin/env python3
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json
from pathlib import Path
def scrape_urlhaus():
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, slow_mo=1200)
context = browser.new_context(
viewport={'width': 1366, 'height': 768},
)
context.add_init_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
context.add_cookies(json.loads(Path("urlhaus_cookies.json").read_text()))
page = context.new_page()
page.goto("https://urlhaus.abuse.ch/")
page.screenshot(path="debug_screenshot.png")
page.wait_for_selector('a.nav-link[href="/browse/"]')
page.click('a.nav-link[href="/browse/"]')
page.wait_for_selector('table.table.table-sm.table-hover.table-bordered')
content = page.content()
I am writing a program to scrape URLHaus browse page, however the script does work when I set headless=True and I get the error, but this is not there when I am using headless=False as I get the normal page view.
Upvotes: -1
Views: 61
Reputation: 848
Since it apparently wasn't obvious enough: Being in headless mode triggers their bot-detection and therefore blocks the client
How exactly this is done and how it could be bypassed would require insight into their website code, which they are unlikely to share. As usual there is an arms race between people who want to automate and people who don't want bots on their site, but in terms of puppeteer's headless:false, this battle is lost, since it's too easy to detect.
Upvotes: 0