Anderson Carvalho
Anderson Carvalho

Reputation: 31

Headless doesn't work using Playwright and BeautifulSoup 4

This code is working:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
from datetime import datetime
import time

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto("https://www.apple.com/br/shop/product/MV7N2BE/A/airpods-com-estojo-de-recarga")
    html = page.content()
    soup = BeautifulSoup(html,'html.parser')
    valorAppleStore = soup.select("span.as-price-installments")[-2].get_text().replace(" à vista (10% de desconto)", '')
    print(valorAppleStore)
    browser.close()

But if I change headless=True, the code returns an error:

Traceback (most recent call last):
  File "c:/Users/ANDERSONCARVALHODELI/Documents/py/AirpodsPW.py", line 19, in <module>
    valorAppleStore = soup.select("span.as-price-installments")[-2].get_text().replace(" à vista (10% de desconto)", 
'')
IndexError: list index out of range

I fixed this using:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
from datetime import datetime
import time

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto("https://www.apple.com/br/shop/product/MV7N2BE/A/airpods-com-estojo-de-recarga")
    time.sleep(1)
    browser.close()
    
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://www.apple.com/br/shop/product/MV7N2BE/A/airpods-com-estojo-de-recarga")
    html = page.content()
    soup = BeautifulSoup(html,'html.parser')
    valorAppleStore = soup.select("span.as-price-installments")[-2].get_text().replace(" à vista (10% de desconto)", '')
    print(valorAppleStore)

But I think this is not the better choice. How do I fix this without opening the browser using headless=False and stick to headless=True?

When I print(html) before soup=..., I see:

    <!DOCTYPE html><html><head> <title>Page Not Found - Apple</title> <link rel="stylesheet" href="https://www.apple.com/wss/fonts?families=SF+Pro,v1|SF+Pro+Icons,v1"> <link rel="stylesheet" href="https://www.apple.com/v/errors/c/built/styles/main.built.css" type="text/css"> <link rel="stylesheet" href="https://www.apple.com/v/errors/c/built/styles/overview.built.css" type="text/css"> <link rel="stylesheet" href="https://store.storeimages.cdn-apple.com/4982/store.apple.com/shop/rs-external/rel/us/external.css"> <link rel="stylesheet" href="https://store.storeimages.cdn-apple.com/4982/store.apple.com/shop/rs-globalelements/dist/us/globalelements.css"> <style>.more::after{content: "";}a.pointer, a.more, a.block span.more, button.unbutton.more{padding-right: .7em; background-image: url(https://store.storeimages.cdn-apple.com/4982/store.apple.com/shop/rs-web/2/dist/assets/as-legacy/base/link/res/more.svg); background-repeat: no-repeat; background-position: 100% 50%; background-size: 5px 9px; zoom: 1;}.as-globalfooter-directory-column-section-list a{margin-bottom: .8em; display: block}.as-globalfooter-directory-column-section-list a:last-child{margin-bottom: 0;}.as-globalfooter-mini .as-globalfooter-mini-shop a{color: #06c;}.as-globalfooter .as-globalfooter-mini-legal-copyright, .as-footnotes .as-globalfooter-mini-legal-copyright, .as-globalfooter .as-globalfooter-mini-legal-link, .as-footnotes .as-globalfooter-mini-legal-link{top: -3px; position: relative; z-index: 1;}.as-globalfooter .as-globalfooter-directory+.as-globalfooter-mini, .as-footnotes .as-globalfooter-directory+.as-globalfooter-mini{padding-bottom: 26px;}.container{position: relative;}hr{display: inline-block; border: 0px; border-top: 0.1em solid #CCD2D9; width: 100%}</style></head><body class="page-overview"> <nav data-store-api="/shop/bag/status" id="ac-globalnav"> <div class="ac-gn-content"> <ul class="ac-gn-list"> <a href="/" class="ac-gn-link ac-gn-link-apple"> <p class="ac-gn-link-text">Apple</p></a> <a href="/us/shop/goto/store" class="ac-gn-link ac-gn-link-store"> <p class="ac-gn-link-text">Store</p></a> <a href="/mac/" class="ac-gn-link ac-gn-link-mac"> <p class="ac-gn-link-text">Mac</p></a> <a href="/ipad/" class="ac-gn-link ac-gn-link-ipad"> <p class="ac-gn-link-text">iPad</p></a> <a href="/iphone/" class="ac-gn-link ac-gn-link-iphone"> <p class="ac-gn-link-text">iPhone</p></a> <a href="/watch/" class="ac-gn-link ac-gn-link-watch"> <p class="ac-gn-link-text">Watch</p></a> <a href="/airpods/" class="ac-gn-link ac-gn-link-airpods"> <p class="ac-gn-link-text">AirPods</p></a> <a href="/tv-home/" class="ac-gn-link ac-gn-link-tvhome"> <p class="ac-gn-link-text">TV &amp; Home</p></a> 
<a href="/services/" class="ac-gn-link ac-gn-link-onlyonapple"> <p class="ac-gn-link-text">Only on Apple</p></a> <a href="/us/shop/goto/buy_accessories" class="ac-gn-link ac-gn-link-accessories"> <p class="ac-gn-link-text">Accessories</p></a> <a href="https://support.apple.com" class="ac-gn-link ac-gn-link-support"> <p class="ac-gn-link-text">Support</p></a> <li class="ac-gn-item ac-gn-item-menu ac-gn-search"> <a id="ac-gn-link-search" class="ac-gn-link ac-gn-link-search" href="/us/search" data-analytics-title="search" data-analytics-intrapage-link="" aria-label="Search apple.com" role="button" aria-haspopup="true"></a> </li><a href="/us/shop/goto/bag" class="ac-gn-link ac-gn-link-bag"> <p class="ac-gn-link-text">Shopping Bag</p></a> </ul> </div></nav> <div id="ac-gn-placeholder"> </div><main id="main" class="main" role="main" data-page-type="overview"> <h1 class="section-headline typography-headline">The page you’re looking for can’t be found.</h1> <aside id="search-wrapper" role="search" data-analytics-region="search" aria-hidden="false"> <form id="searchform-form" class="searchform" action="/us/search" method="get" data-suggestions-url="/search-services/suggestions/"><input id="searchform-input" type="text" class="form-textbox form-textbox-text form-icon-left" aria-labelledby="textbox_label" required="" aria-required="true" data-placeholder-long="Search for Products, Stores, and Help" autocorrect="off" autocapitalize="off" autocomplete="off"><span class="form-label" id="textbox_label" aria-hidden="true">Search apple.com</span> <div id="searchform-submit" class="form-icons-wrapper form-icons-wrapper-left form-icons-focusable" type="submit" aria-label="Submit"><button class="form-icons form-icons-search15"></button></div><div id="searchform-reset" class="button-reset form-icons-wrapper form-icons-focusable" type="reset" disabled="" aria-label="Clear Search"><button class="form-icons form-icons-small form-icons-clearsolid15 form-icon-reset"></button></div></form> </aside> <div class="cta-sitemap"> <div class="cta-sitemap"> <a href="/sitemap/" class="more" style="top: bottom">Or see our site map</a> </div></div></main> <footer class="as-globalfooter as-globalfooter-contained"> <div class="as-globalfooter-content"> <div class="as-globalfooter-breadcrumbs"> <a href="/" class="as-globalfooter-breadcrumbs-home"> <p class="as-globalfooter-breadcrumbs-home-icon"></p><p class="as-globalfooter-breadcrumbs-home-label">Apple</p></a> <div class="as-globalfooter-breadcrumbs-path"> <ol class="as-globalfooter-breadcrumbs-list"> <li class="as-globalfooter-breadcrumbs-item breadcrumbs-title"> Page Not Found</li></ol> </div></div><nav class="as-globalfooter-directory with-5-columns"> <div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Shop and Learn</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="/us/shop/goto/store">Store</a> <a href="/mac/">Mac</a> <a href="/ipad/">iPad</a> <a href="/iphone/">iPhone</a> <a href="/watch/">Watch</a> <a href="/airpods/">AirPods</a> <a href="/tv-home/">TV &amp; Home</a> <a href="/ipod-touch/">iPod touch</a> <a href="/airtag/">AirTag</a> <a href="/us/shop/goto/buy_accessories">Accessories</a> <a href="/us/shop/goto/giftcards">Gift Cards</a> </ul> </div></div><div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Services</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="/apple-music/">Apple Music</a> <a href="/apple-tv-plus/">Apple TV+</a> <a href="/apple-fitness-plus/">Apple Fitness+</a> <a href="/apple-news/">Apple News+</a> <a href="/apple-arcade/">Apple Arcade</a> <a href="/icloud/">iCloud</a> <a href="/apple-one/">Apple One</a> <a href="/apple-card/">Apple Card</a> <a href="/apple-books/">Apple Books</a> <a href="/apple-podcasts/">Apple Podcasts</a> <a href="/app-store/">App Store</a> </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Account</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="https://appleid.apple.com/us/">Manage Your Apple ID</a> <a href="/us/shop/goto/account">Apple Store Account</a> <a href="https://www.icloud.com">iCloud.com</a> </ul> </div></div><div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Apple Store</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="/retail/">Find a Store</a> <a href="/retail/geniusbar/">Genius Bar</a> <a href="/today/">Today at Apple</a> <a href="/today/camp/">Apple Camp</a> <a href="https://itunes.apple.com/app/apple-store/id375380948">Apple Store App</a> <a href="/us/shop/goto/special_deals">Refurbished and Clearance</a> <a href="/us/shop/goto/payment_plan">Financing</a> <a href="/us/shop/goto/trade_in">Apple Trade In</a> <a href="/us/shop/goto/order/list">Order Status</a> <a href="/us/shop/goto/help">Shopping Help</a> </ul> </div></div><div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">For Business</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="/business/">Apple and Business</a> <a href="/retail/business/">Shop for Business</a> </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">For Education</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="/education/">Apple and Education</a> <a href="/education/k12/how-to-buy/">Shop for K-12</a> <a href="/us/shop/goto/educationrouting">Shop for College</a> </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">For Healthcare</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="/healthcare/">Apple in Healthcare</a> <a href="/healthcare/apple-watch/">Health on Apple Watch</a> <a href="/healthcare/health-records/">Health Records on iPhone</a> </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">For Government</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="/r/store/government/">Shop for Government</a> <a href="/us/shop/goto/eppstore/veteransandmilitary">Shop for Veterans and Military</a> </ul> </div></div><div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Apple Values</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="/accessibility/">Accessibility</a> <a href="/education/connectED/">Education</a> <a href="/environment/">Environment</a> <a href="/diversity/">Inclusion and Diversity</a> <a href="/privacy/">Privacy</a> <a href="/racial-equity-justice-initiative/">Racial Equity 
and Justice</a> <a href="/supplier-responsibility/">Supplier Responsibility</a> </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">About Apple</h3> <ul class="as-globalfooter-directory-column-section-list"> <a href="/newsroom/">Newsroom</a> <a href="/leadership/">Apple Leadership</a> <a href="/careers/us/">Career Opportunities</a> <a href="https://investor.apple.com">Investors</a> <a href="/compliance/">Ethics &amp; Compliance</a> <a href="/apple-events/">Events</a> <a href="/contact/">Contact Apple</a> </ul> </div></div></nav> <div class="as-globalfooter-mini"> <div class="as-globalfooter-mini-shop">More ways to shop: 
<a href="/retail/">Find an Apple Store</a> or <a href="https://locate.apple.com/">other retailer</a> near you. <span>Or call 1-800-MY-APPLE.</span> </div><div class="as-globalfooter-mini-locale"> <a class="as-globalfooter-mini-locale-link" href="/choose-country-region/" title="Choose your country or region" aria-label="United States. Choose your country or region" data-analytics-title="choose your country">United States</a> </div><p class="as-globalfooter-mini-legal-copyright">Copyright © 2022 Apple Inc. All rights reserved. </p><a class="as-globalfooter-mini-legal-link" href="/legal/privacy/">Privacy Policy </a> <a class="as-globalfooter-mini-legal-link" href="/legal/internet-services/terms/site.html">Terms of Use </a> <a class="as-globalfooter-mini-legal-link" href="/us/shop/goto/help/sales_refunds">Sales 
and Refunds </a> <a class="as-globalfooter-mini-legal-link" href="/legal/">Legal </a> <a class="as-globalfooter-mini-legal-link" href="/sitemap/">Site Map </a> </div></div></footer> <script src="https://www.apple.com/v/errors/c/built/scripts/main.built.js" type="text/javascript" charset="utf-8"></script></body></html>

Upvotes: 3

Views: 4734

Answers (1)

ggorlen
ggorlen

Reputation: 56845

First of all, Playwright already has a full suite of selectors that work on the live page, so to eliminate a dependency, speed up your scrape, use less code and avoid weird errors when the static HTML snapshot gets out of sync with the live page, I suggest skipping BS (this blog post of mine is oriented to Puppeteer/Node, but applies equally to Playwright/Python).

On to the main problem, you've done good by printing the HTML to see what sort of response you're dealing with. The 404 page indicates you've been detected as a bot when running headlessly, but this can often manifest as a captcha, Cloudflare browser check page, or other "are you a robot?" notice.

As with everything in scraping, there's no one-size-fits-all solution, but one typical approach is to set a custom user agent string:

from playwright.sync_api import sync_playwright # 1.44.0

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    ua = (
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
    )
    url = "<Your URL>"
    page = browser.new_page(user_agent=ua)
    page.goto(url, wait_until="domcontentloaded")
    sel = "span.as-price-installments:last-child"
    text = (
        page.wait_for_selector(sel)
        .text_content()
        .replace("à vista (10% de desconto)", "")
        .strip()
    )
    print(text)  # => R$ 1.399,50
    browser.close()

If using a human user agent doesn't unblock you, you can experiment with other means of changing the browser fingerprint, like using an off-the-shelf library like this (note: I have not tried this specific library). There are various cloud services that can run automated browsers with optimized fingerprints on rotating residential proxies.

(Note that the site has changed since this answer was posted, so it's no longer reproducible--the fundamental ideas still hold true, though)

Upvotes: 3

Related Questions