James Sheldon
James Sheldon

Reputation: 73

Python Playwright Download only certain files from a page

I'm attempting to download files from a page that's constructed almost entirely in JS. Here's the setup of the situation and what I've managed to accomplish.

The page itself takes upward of 5 minutes to load. Once loaded, there are 45,135 links (JS buttons). I need a subset of 377 of those. Then, one at a time (or using ASYNC), click those buttons to initiate the download, rename the download, and save it to a place that will keep the download even after the code has completed.

Here's the code I have and what it manages to do:

import asyncio
from playwright.async_api import async_playwright
from pathlib import Path

path = Path().home() / 'Downloads'
timeout = 300000                   # 5 minute timeout

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto("https://my-fun-page.com", timeout=timeout)
        await page.wait_for_selector('ul.ant-list-items', timeout=timeout) # completely load the page
        downloads = page.locator("button", has=page.locator("span", has_text="_Insurer_")) # this is the list of 377 buttons I care about
        # texts = await downloads.all_text_contents() # just making sure I got what I thought I got
        count = await downloads.count() # count = 377.

# Starting here is where I can't follow the API

        for i in range(count):
            print(f"Starting download {i}")
            await downloads.nth(i).click(timeout=0)
            page.on("download", lambda download: download.save_as(path / download.suggested_filename))
            print("\tDownload acquired...")
        await browser.close()

asyncio.run(main())

UPDATE: 2022/07/15 15:45 CST - Updated code above to reflect something that's closer to working than previously but still not doing what I'm asking.

The code above is actually iterating over the locator object elements and performing the downloads. However, the page.on("download") step isn't working. The files are not showing up in my Downloads folder after the task is completed. Thoughts on where I'm missing the mark?

Python 3.10.5 Current public version of playwright

Upvotes: 0

Views: 1599

Answers (1)

Charchit Agarwal
Charchit Agarwal

Reputation: 3757

  1. First of all, download.save_as returns a coroutine which you need to await. Since there is no such thing as an "aysnc lambda", and that coroutines can only be awaited inside async functions, you cannot use lambda here. You need to create a separate async function, and await download.save_as.
  2. Secondly, you do not need to repeatedly call page.on. After registering it once, the callable will be called automatically for all "download" events.
  3. Thirdly, you need to call page.on before the download actually happens (or before the event fires, in general). It's often best to shift these calls right after you create the page using .new_page().

A Better Solution

These were the obvious mistakes in your approach, fixing them should make it work. However, since you know exactly when the downloads are going to take place (after you click downloads.nth(i)), I would suggest using expect_download instead. This will make sure that the file is completely downloaded before your main program continues (callables registered with events using page.on are not awaited). Your code will somewhat become like this:

for i in range(count):
    print(f"Starting download {i}")
    async with page.expect_download() as download_handler:
        await downloads.nth(i).click(timeout=0)
        
    download = await download_handler.value
    await download.save_as(path + f'\\{download.suggested_filename}')
    
    print("\tDownload acquired...")

Upvotes: 2

Related Questions