Alissa L.
Alissa L.

Reputation: 103

How to download all files from webpage that correspond to a certain string in the title?

I need to download all the files from this page :

https://www.dmo.gov.uk/publications/?offset=0&itemsPerPage=1000000&parentFilter=1433&childFilter=1433%7C1450&startMonth=1&startYear=2008&endMonth=6&endYear=2021

that have "Auction of" on their titles. This is the source for one of the files for example:

<a href="/media/17527/pr090621b.pdf" aria-label="Auction of £2,500 million  of 0 5/8% Treasury Gilt 2035, published 09 June 2021">Auction of £2,500 million  of 0 5/8% Treasury Gilt 2035</a>

I am trying to adapt some code I found from another question, but the pages are coming back empty:

import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def download_pgn(task):
    session, url, destination_path = task
    response = session.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    game_url = host + soup.find("a", text="download").get("href")
    filename = re.search(r"\w+\.pgn", game_url).group()
    path = os.path.join(destination_path, filename)
    response = session.get(game_url, stream=True)
    response.raise_for_status()

    with open(path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

if __name__ == "__main__":
    
    destination_path = "pgns"
    max_workers = 8

    if not os.path.exists(destination_path):
        os.makedirs(destination_path)
    
    with requests.Session() as session:
        response = session.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        pages = soup.find_all("a", href=re.compile(r".*Auction of\?.*"))
        tasks = [
            (session, host + page.get("href"), destination_path) 
            for page in pages
        ]

        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            pool.map(download_pgn, tasks)

Upvotes: 0

Views: 367

Answers (3)

MendelG
MendelG

Reputation: 20018

The find_all() method accepts a function. You can create a lambda function to filter all for a tags that contain "Auction of":

for tag in soup.find_all(lambda t: t.name == "a" and "Auction of" in t):
    print(tag.text)

Or, you can use an [attribute*=value]:

# Find all `aria-label` attributes under an `a` that contain `Auction of`
for tag in soup.select("a[aria-label*='Auction of']"):
    print(tag.text)

Upvotes: 0

alexis
alexis

Reputation: 50190

Check your regular expression syntax. The regex r".*Auction of\?.*" will only match titles with an actual of? in the title.

But the href= parameter will search against the URL in the link, so that won't help you much either. This will find the links with the matching titles:

links = soup.find_all("a", string=re.compile(r"Auction of\b"))

And this will extract their URLs so you can retrieve them:

[ file["href"] for file in links ]

Upvotes: 1

Alissa L.
Alissa L.

Reputation: 103

This is what ended up working for me:

from bs4 import BeautifulSoup
import requests
import re
links = []
url = 'https://www.dmo.gov.uk/publications/?offset=0&itemsPerPage=1000000000&parentFilter=1433&childFilter=1433|1450&startMonth=1&startYear=2000&endMonth=6&endYear=2021'
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
for a in soup.find_all("a",{"aria-label":re.compile(r"^Auction of\b")}, href=True):
    links.append(a['href'])

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

host = 'https://www.dmo.gov.uk/'

for link in links:
    url = host + link
    download_file(url)

Upvotes: 0

Related Questions