Reputation: 61
I am trying to get the href links from the first table of a headless browser page but the error isn't helping me as it's not telling me what it is, just lots of ^ symbols underneath.
I had to switch to a headless browser because I was scraping empty tables for how the site's HTML works and I admit I don't understand how it works.
I also want to complete the links so that they work for further use, which is the last three lines of the following code:
from playwright.sync_api import sync_playwright
# headless browser to scrape
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://fbref.com/en/comps/9/Premier-League-Stats")
#open the file up
with open("path", 'r') as f:
file = f.read()
years = list(range(2024,2022, -1))
all_matches = []
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"
for year in years:
standings_table = page.locator("table.stats_table").first
link_locators = standings_table.get_by_role("link").all()
for l in link_locators:
l.get_attribute("href")
print(link_locators)
link_locators = [l for l in links if "/squads/" in l]
team_urls = [f"https://fbref.com{l}" for l in link_locators]
print(team_urls)
browser.close()
The stack trace I get is just:
Traceback (most recent call last):
File "path", line 27, in <module>
link_locators = standings_table.get_by_role("link").all()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "path\.venv\Lib\site-packages\playwright\sync_api\_generated.py", line 15936, in all
return mapping.from_impl_list(self._sync(self._impl_obj.all()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "path\.venv\Lib\site-packages\playwright\_impl\_sync_base.py", line 102, in _sync
raise Error("Event loop is closed! Is Playwright already stopped?")
playwright._impl._errors.Error: Event loop is closed! Is Playwright already stopped?
Process finished with exit code 1
My code is only 33 lines as it's the start of a loop, so I'm unsure what the last two errors in the stack refer to.
I just can't extract the href links. It might have to do with .first
.
I implemented the solution from Get href link using python playwright but it doesn't work.
Upvotes: 1
Views: 214
Reputation: 57195
When the context manager (with
) block ends, the page and browser are closed, so you can't use them outside the block. A minimal reproduction of the error is:
from playwright.sync_api import sync_playwright # 1.40.0
with sync_playwright() as p:
browser = p.chromium.launch()
browser.close()
Here's a rewrite suggestion:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
url = "<Your URL>"
page.goto(url, wait_until="domcontentloaded")
team_urls = []
for year in range(2024, 2022, -1):
standings_table = page.locator("table.stats_table").first
for x in standings_table.get_by_role("link").all():
href = x.get_attribute("href")
if "/squads/" in href:
team_urls.append(f'https://www.fbref.com{href}')
print(team_urls)
browser.close()
Blocking resources can help speed things up a bit:
# ...
def handle(route, request):
block = "image", "script", "xhr", "fetch"
if request.resource_type in block:
return route.abort()
route.continue_()
page.route("**", handle)
page.goto(url, wait_until="domcontentloaded")
# ...
But you can also do this more simply and efficiently without Playwright, since the data is available in the static HTML:
import requests # 2.25.1
from bs4 import BeautifulSoup # 4.10.0
url = "<Your URL>"
soup = BeautifulSoup(requests.get(url).text, "lxml")
team_urls = []
for year in range(2024, 2022, -1):
standings_table = soup.select_one("table.stats_table")
for x in standings_table.select("a"):
href = x["href"]
if "/squads/" in href:
team_urls.append(f'https://www.fbref.com{href}')
print(team_urls)
Benchmark:
Playwright (with blocked resources):
real 0m4.875s
user 0m1.331s
sys 0m0.250s
Requests/BS:
real 0m0.517s
user 0m0.376s
sys 0m0.029s
Upvotes: 1